Inverse Recipe Generation from Food Images
Team Members
- Michael Tu - mstu
- Rhea Goyal - rgoyal6
- Elijah Whang - eswhang
- Michael Sun - msun59
Check In 3
Final Write-up
Introduction
Have you ever seen a food picture on social media and wished you could make it yourself? It’s extremely hard for humans to be able to look at a picture of food and devise a recipe to recreate the dish. We are attempting to train a neural network to solve this problem. We will take in inputs of food images and output a step-by-step recipe to recreate the food in the picture. Using this model, we hope to make cooking more accessible for the average home cook, so that their cooking repertoire is not limited to the cookbooks they own, or the well-taught tutorials they can find online, but any picture of food on the Internet.
The goal of this paper is to create a model that can generate cooking instructions / a recipe based on an image of a dish and its ingredients. Specifically, the paper explores how different attention strategies allow for a model to learn from both images and text simultaneously. And, the paper focuses on different representations of ingredient lists to improve ingredient generation.
We chose this paper because it aligned with our interests for the final project (combining NLP and image processing), and it has good documentation in terms of code and theory. We also thought that the idea was particularly interesting, and could prove extremely useful.
There are several aspects of the project that involve different types of learning at different points. The main problem that this model tackles involves multi-label classification, with unsupervised learning in the feature extraction from the images. Once we know what ingredients are in each dish, the model will use a more structured prediction to generate cooking instructions based on the ingredients and the image of the final dish.
Related Work
The paper “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images” introduced the Recipe1M+ dataset, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images, that is used in the paper that we are re-implementing. Using this data, a neural network is trained to learn a joint embedding of recipes and images that yielded great results on an image-retrieval task. The paper also showed that regularization via the addition of high-level classification objectives improves retrieval performance and enables semantic vector arithmetic.
Other Public Implementations
Data
We are using the Yummly20K dataset that was curated by web scraping a cooking website. It has 27,638 recipes in total. Based on the preprocessing steps mentioned in the paper, we will likely use the same steps. One step is reducing the size of the ingredient vocabulary found in the dataset (i.e. by replacing various kinds of the same ingredient with just the core ingredient). We will need to add sentence markers for the beginning and end of recipes and individual instructions. We will also probably preprocess (crop) the dataset’s images.
Other Possible Datasets
Methodology
The two main parts of this model are ingredient prediction and then using the ingredients to generate cooking instructions. The first part will use an image encoder to encode the food images into a feature vector (the paper uses ResNet-50). The image features are then fed into a transformer-based model in the decoder block that predicts the set of ingredients used.
After this, the encoded image as well as the predicted ingredients (also encoded into embeddings) will be input into the second part, which predicts the instructions. The part responsible for this will be a transformer-based model that will make predictions based on concatenated embeddings of the image and ingredients.
The first step would involve training the ingredient prediction model, evaluated using binary cross-entropy loss, since a prediction is either correct or incorrect for a given ingredient. Once the model is trained, the cooking instruction generation model will be trained using the predicted ingredients and image features as input. The paper trained their model to minimize negative log-likelihood.
There are two novel methods that the paper uses that might be hard to implement. The first is the “set transformer” where the ingredient decoder attempts to predict a “set” of ingredients, instead of a list of ingredients. We need to be careful when implementing the loss function used to train this transformer. The second is conditioning instruction generation on image encodings and a set of ingredients. Since we have two inputs for the instructions transformer, we need to consider how to compute attention on both inputs and feed both inputs into the feed-forward layers.
Metrics
We could test our model against recipes scraped from other recipe websites (as the Yummly dataset only has recipes from Yummly) to measure the accuracy of our model’s ingredient list generation and overall recipe generation. It could also be interesting to conduct user studies like the ones described in the paper.
It does make sense to measure the accuracy of our model since the goal is to be able to generate an accurate list of ingredients and instructions for making a dish. So, we will be able to compare what our model generates with the ‘actual’ ingredients/recipe attached to an image. At the same time, it is important to consider that cooking isn’t a strict science – two recipes for the same dish could have different ingredients and instructions based on the person who created the recipe. So, we would only be measuring the accuracy of guessing the most likely ingredients and recipe.
The authors of the paper hoped to find that their model performed better than other related models for both ingredient prediction and recipe generation. The authors split their dataset and used it to test their models. They then compared the results of the models (including a human baseline system). The authors used several different metrics to evaluate the two transformers of the model. The paper used IoU (intersection over union) and F1 score to evaluate the performance of the ingredient prediction transformer and compared the ingredient prediction transformer with other related models that predict ingredients from images. Additionally, the paper used perplexity and precision & recall between the generated and the original instructions to evaluate the performance of the recipe generation transformer. Most crucially, the paper used a human survey to evaluate the generated recipes against human-made recipes.
Goals
- Base: re-implement the model outlined in the paper and have it train and validate properly on our dataset (without worrying about the accuracy)
- Target: successfully re-implement the model outlined in the paper with equal accuracy to that achieved in the paper
- Stretch: successfully re-implement the model outlined in the paper with better accuracy than that achieved in the paper
Ethics
Deep Learning is a good approach because it can be difficult if not impossible for humans to identify what ingredients make up a certain dish. Therefore, Deep Learning can assist in this task that humans may not be very capable of.
The major stakeholders would likely be anyone who has an interest in cooking. This could include the average person who cooks for themselves or maybe even chefs. Mistakes can make the food taste bad if the ingredients are incorrect, and someone were to try to recreate the dish.
Division of Labor
We will split the project into two main parts of the model: generating an ingredient list based on an image and generating a set of instructions based on an ingredient list.
- Ingredient transformer – Michael Sun and Rhea Goyal
- Instruction generator – Michael Tu and Elijah Whang
Built With
- keras
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.