Final write-up https://docs.google.com/document/d/14Zo1c3AaoHkezfAbzESQSCKaawt2m4TOvm7zZ191Uzc/edit?usp=sharing

Visual question answering with recipes

Bojun Lin(blin26), Yichen Chai(ychai3), Peter Lyu(zlyu6)

1.Introduction In recent years, there have been some efforts to connect recipes and instructions with food images to let machine learning models learn about the relationship between a food image and its corresponding recipe, like the Recipe1M+ dataset (https://arxiv.org/pdf/1810.06553.pdf). There’s a new dataset called RecipeQA(https://arxiv.org/pdf/1809.00812.pdf) which combines the food-recipe task with the question-answering task, where the instructional recipes with images are provided, and the questions require a joint understanding of images and texts. We’re planning to build a model that uses NLP techniques like RNN/Transformer as well as CV techniques like ResNet to obtain information from the text and image respectively and use these pieces of information to answer the question, to achieve a multimodal comprehension of cooking-recipes on the RecipeQA dataset.

2.Relate Works There are a lot of researches trying to solve the similar food recipe dataset, for example, the Recipe1M+ dataset. The RecipeQA is similar to Recipe1M+ in that both datasets include texts(recipe) and images(food pictures), but RecipeQA is more complicated with the introduction of question answering task, and images for each steps in the recipes. Different researchers have created different model structures to deal with the recipe/food image joint understanding problem. We read the following research papers for inspiration: Learning Cross-modal Embeddings for Cooking Recipes and Food Images (http://pic2recipe.csail.mit.edu/im2recipe.pdf) Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images (http://pic2recipe.csail.mit.edu/tpami19.pdf) VQA: Visual Question Answering (https://arxiv.org/pdf/1505.00468v6.pdf) VideoBERT: A Joint Model for Video and Language Representation Learning (https://arxiv.org/pdf/1904.01766.pdf)

  1. Data As the introduction says, our project will train on the RecipeQA dataset. The dataset contains the recipes(texts), corresponding step pictures(images), and some query questions we wish our model can answer. The size of the RecipeQA dataset is 2.68GB and it is written in json format. Therefore we need to do preprocessing to extract the recipes and query questions from the dataset, and load the corresponding images in batches to form the data for training. There are four kinds of tasks in the dataset, textual cloze, visual cloze, visual coherence and visual ordering. We plan to tackle the last three tasks first, and if time permits, also the textual cloze task. The following image shows an example of the visual cloze task.

4.Methodology To make the model able to answer questions about both recipes and corresponding cooking images, we need to let our model deal with both natural languages and images. We will combine NLP techniques such as self-attention, long-short term memory and CV techniques, such as convolutions, pooling in our model. After processing natural languages and images separately, we can aggregate the results from them to make the predictions, so that the model can relate the steps described in recipes to the corresponding images and have a multimodal understanding of the problem. The following image shows a general architecture of visual question answering models. Our model will be based on this architecture, but we’ll also modify the structure and may integrate advanced techniques like VideoBERT into our model, to achieve better performance on the dataset.

  1. Metrics After training the model, we can ask the questions about recipe and give the model multiple images as potential answers. Let the model chooses the images mostly likely answer the questions. We can test accuracy of our model by the correctness of the answers. Since this is a new dataset, not a lot of researches has been done on this. The website () can give score to the performance of model. Since their baseline is 20, our base will be 20, and our traget goal is 30, the strech goal will above 30.

  2. Ethics What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset are recipes and food images. Since the recipes are mostly about food, we don’t think there will be privacy problems when collecting them. Since the dataset mostly are recipes for western foods or western styles for making foods. It might be not representative in different areas whose chefs have different cooking habits. And the dataset might have many recipes containing how to make meats, which will train the model unfriendly to vegetarians. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? I think the major stakeholders are the people who try to learning how to make specific dishes from this model. The mistakes produced by algorithm can cause serious problems if the model asks people to mix something not supposed to mixed. It might cause health problems.

  3. Division of labor: Since there are three different types of questions, we plan to train three different models separately so that each can deal with one single type of question. If time permits, we will try to merge three models to one.

Built With

Share this project: