Introduction: This can be copied from the proposal.
We are implementing an existing paper. The goal of the paper is to explore using Transformers for image captioning instead of RNNs which are currently the industry standard for captioning tasks. We chose this paper because it is well documented and uses a publicly accessible dataset. The paper has good results and uses two deep learning strategies we learned in class to caption, which are CNNs and Transformers. Captioning is an end to end sequence-to-sequence problem similar to some of the word based projects we did in class.
Challenges: What has been the hardest part of the project you’ve encountered so far?
The hardest part of the project we’ve encountered so far has been getting the data and the preprocessing component. The data was very hard to get because the tensorflow dataset was corrupted which was our original plan to get the data. Instead we had to find a json file which had incorrect URL links, but correct file names. We could use these filenames to eventually get the photos but it required us to append them to a different set of URLs. We also looked into getting the data from the COCO website but the website no longer works and the file downloads were dead. This lack of data stunted our development on the project because we have only just found a way to access the data and have not been able to test our model. The next challenge we have is to convert the data into usable images (i.e. make them all the same size and get bounding boxes). Building the model hasn’t been that tough but all of the challenges resulting from getting the data has put us behind schedule.
Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations?
We have no concrete results that we can show at this point. We have a model that has not been run yet, but when we run it, it will simply be a matter of tweaking. Therefore, we are quite close. The bulk of our work to this point has been working on preprocessing. We can show the work that we have done getting the data into our program and into the model. We have yet to show the efficacy of our model; however, we are close to showing the results of it. We have two ways of preprocessing that work at this point: looking up 100 images at a time for a batch and then getting rid of them when we are done, and downloading all of the images onto a computer and using them as needed. The first method of looking up images loads a batch of 100 images in about 10 seconds, so 80,000 images will take approximately 2 hours for preprocessing each time we want to run our model (this is in old_preprocessing.py). The second option is much faster once the 2 hour download is complete, but takes around 30 GBs of storage space (this is in preprocessing.py).
Plan: Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything?
We are on track to finish our project by the deadline. We need to dedicate more time to the development of our model. We were held up for a long time because of the difficulties with downloading the data, and as a result, that has been the bulk of the work we have done so far. Tweaking the model and showing the results of the model will be the bulk of the work that we do in the next week.
Log in or sign up for Devpost to join the conversation.