Inverse Recipe Generation from Food Images

Team Members

  • Michael Tu - mstu
  • Rhea Goyal - rgoyal6
  • Elijah Whang - eswhang
  • Michael Sun - msun59

Check In 3

Link

Final Write-up

Link

Introduction

Have you ever seen a food picture on social media and wished you could make it yourself? It’s extremely hard for humans to be able to look at a picture of food and devise a recipe to recreate the dish. We are attempting to train a neural network to solve this problem. We will take in inputs of food images and output a step-by-step recipe to recreate the food in the picture. Using this model, we hope to make cooking more accessible for the average home cook, so that their cooking repertoire is not limited to the cookbooks they own, or the well-taught tutorials they can find online, but any picture of food on the Internet.

The goal of this paper is to create a model that can generate cooking instructions / a recipe based on an image of a dish and its ingredients. Specifically, the paper explores how different attention strategies allow for a model to learn from both images and text simultaneously. And, the paper focuses on different representations of ingredient lists to improve ingredient generation.

We chose this paper because it aligned with our interests for the final project (combining NLP and image processing), and it has good documentation in terms of code and theory. We also thought that the idea was particularly interesting, and could prove extremely useful.

There are several aspects of the project that involve different types of learning at different points. The main problem that this model tackles involves multi-label classification, with unsupervised learning in the feature extraction from the images. Once we know what ingredients are in each dish, the model will use a more structured prediction to generate cooking instructions based on the ingredients and the image of the final dish.

Related Work

The paper “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images” introduced the Recipe1M+ dataset, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images, that is used in the paper that we are re-implementing. Using this data, a neural network is trained to learn a joint embedding of recipes and images that yielded great results on an image-retrieval task. The paper also showed that regularization via the addition of high-level classification objectives improves retrieval performance and enables semantic vector arithmetic.

Other Public Implementations

Data

We are using the Yummly20K dataset that was curated by web scraping a cooking website. It has 27,638 recipes in total. Based on the preprocessing steps mentioned in the paper, we will likely use the same steps. One step is reducing the size of the ingredient vocabulary found in the dataset (i.e. by replacing various kinds of the same ingredient with just the core ingredient). We will need to add sentence markers for the beginning and end of recipes and individual instructions. We will also probably preprocess (crop) the dataset’s images.

Other Possible Datasets

Methodology

The two main parts of this model are ingredient prediction and then using the ingredients to generate cooking instructions. The first part will use an image encoder to encode the food images into a feature vector (the paper uses ResNet-50). The image features are then fed into a transformer-based model in the decoder block that predicts the set of ingredients used.

After this, the encoded image as well as the predicted ingredients (also encoded into embeddings) will be input into the second part, which predicts the instructions. The part responsible for this will be a transformer-based model that will make predictions based on concatenated embeddings of the image and ingredients.

The first step would involve training the ingredient prediction model, evaluated using binary cross-entropy loss, since a prediction is either correct or incorrect for a given ingredient. Once the model is trained, the cooking instruction generation model will be trained using the predicted ingredients and image features as input. The paper trained their model to minimize negative log-likelihood.

There are two novel methods that the paper uses that might be hard to implement. The first is the “set transformer” where the ingredient decoder attempts to predict a “set” of ingredients, instead of a list of ingredients. We need to be careful when implementing the loss function used to train this transformer. The second is conditioning instruction generation on image encodings and a set of ingredients. Since we have two inputs for the instructions transformer, we need to consider how to compute attention on both inputs and feed both inputs into the feed-forward layers.

Metrics

We could test our model against recipes scraped from other recipe websites (as the Yummly dataset only has recipes from Yummly) to measure the accuracy of our model’s ingredient list generation and overall recipe generation. It could also be interesting to conduct user studies like the ones described in the paper.

It does make sense to measure the accuracy of our model since the goal is to be able to generate an accurate list of ingredients and instructions for making a dish. So, we will be able to compare what our model generates with the ‘actual’ ingredients/recipe attached to an image. At the same time, it is important to consider that cooking isn’t a strict science – two recipes for the same dish could have different ingredients and instructions based on the person who created the recipe. So, we would only be measuring the accuracy of guessing the most likely ingredients and recipe.

The authors of the paper hoped to find that their model performed better than other related models for both ingredient prediction and recipe generation. The authors split their dataset and used it to test their models. They then compared the results of the models (including a human baseline system). The authors used several different metrics to evaluate the two transformers of the model. The paper used IoU (intersection over union) and F1 score to evaluate the performance of the ingredient prediction transformer and compared the ingredient prediction transformer with other related models that predict ingredients from images. Additionally, the paper used perplexity and precision & recall between the generated and the original instructions to evaluate the performance of the recipe generation transformer. Most crucially, the paper used a human survey to evaluate the generated recipes against human-made recipes.

Goals

  • Base: re-implement the model outlined in the paper and have it train and validate properly on our dataset (without worrying about the accuracy)
  • Target: successfully re-implement the model outlined in the paper with equal accuracy to that achieved in the paper
  • Stretch: successfully re-implement the model outlined in the paper with better accuracy than that achieved in the paper

Ethics

Deep Learning is a good approach because it can be difficult if not impossible for humans to identify what ingredients make up a certain dish. Therefore, Deep Learning can assist in this task that humans may not be very capable of.

The major stakeholders would likely be anyone who has an interest in cooking. This could include the average person who cooks for themselves or maybe even chefs. Mistakes can make the food taste bad if the ingredients are incorrect, and someone were to try to recreate the dish.

Division of Labor

We will split the project into two main parts of the model: generating an ingredient list based on an image and generating a set of instructions based on an ingredient list.

  • Ingredient transformer – Michael Sun and Rhea Goyal
  • Instruction generator – Michael Tu and Elijah Whang

Built With

Share this project:

Updates

posted an update

CHECK IN 3

Introduction: What problem are you trying to solve and why?

Have you ever seen a food picture on social media and wished you could make it yourself? It’s extremely hard for humans to be able to look at a picture of food and devise a recipe to recreate the dish. We are attempting to train a neural network to solve this problem. We will take in inputs of food images and output a step-by-step recipe to recreate the food in the picture. Using this model, we hope to make cooking more accessible for the average home cook, so that their cooking repertoire is not limited to the cookbooks they own, or the well-taught tutorials they can find online, but any picture of food on the Internet.

Challenges

So far, the hardest part has been determining what aspects of the previous implementation in pytorch we should keep and convert to tensorflow, while filtering out parts that are extraneous. In trying to convert some of the classes and functions to tensorflow, there have been decisions that we’ve had to make without being able to verify them in the scope of the entire project. In addition to this, converting some of the pytorch methods to tensorflow has been quite challenging. Not only are there quite a few pytorch methods used throughout the code, but the methods are quite significant, requiring a lot of research on how to convert them to tensorflow. So, this likely means we will have to dedicate more time in the future to putting the different pieces of the project together.

Additionally, in pre-processing, since our dataset is different from the one used in the paper, we needed to obtain the data from a CSV into the correct Python data structures before using similar pre-processing functions as the original paper. This required us to convert it into a pandas dataframe, whereas the original paper used the Python json library to retrieve data from their dataset’s JSON files.

Insights

So far, we haven’t been able to see any results, as we are still in the process of re-implementing the model. We have finished converting the preprocessing and some modules of the model architecture.

Plan

At the moment, we still need to implement the training loop of our model, as well as putting the two components of the model together. Since we have been working on different parts of the model separately, we envision that making sure each part is compatible with each other may take some tweaking once the individual parts are done. We will likely need to dedicate more time towards testing our implementation once it’s done, as there could be bugs that aren’t obvious enough to catch before trying to train our model.

Log in or sign up for Devpost to join the conversation.