posted an update

Introduction: We are implementing the model from the paper “Inverse Cooking: Recipe Generation from Food Images”, which takes as input an image of food and is able to generate a novel set of ingredients, as well as cooking instructions based off of those ingredients. We chose this project because it seemed like an interesting and novel way to combine different aspects of deep learning that we had learned about, such as transformers and NLP models and image models.

Challenges: So far our greatest challenge has been in obtaining a good dataset to use. We have been unable to access the dataset that we had intended to use, the Recipe1M dataset, due to a technical glitch on the website, so we are currently training our architecture on a smaller dataset that doesn’t have instructions for now, which can be found at http://www.ub.edu/cvub/recipes5k/. This dataset consists of approximately 5000 images of food and corresponding ingredients. This is a fairly challenging situation because to get higher accuracies, we’ll need more data. This means it will be very tough to reach the metrics achieved by the original researchers, who used the Recipe1M dataset.

We plan to email the researchers that built the Recipe1M dataset in order to get the full dataset required for this project. Either way, we can start working on the first part where we predict ingredients from an image.

Insights: We haven’t implemented our architecture yet, but we wrote the initial code for our preprocessing that essentially builds the ingredient vocabulary. However, we thoroughly outlined our model, discussing how we go from the preprocessing format of the data to the output of the set transformer that decodes the ingredients (producing the list of ingredients for each image). We also got into details of the transformer, which will be tailored to the specific task of classifying ingredients. This includes adding a max-pool layer between each time step in order to avoid penalizing for order (the order of ingredients for a recipe should not matter, whereas an autoregressive model like the transformer gives weight to order (eg: sequence of words in a sentence)), and also learning when the transformer model should stop predicting ingredients for a given image. So far, we are very clear about how the instruction decoder to predict ingredients will work.

Plan: Since we got basic preprocessing done, we are on track as we have thoroughly detailed the architecture. One major change is that we might only be able to implement the first part of our initially proposed project, where we only predict the ingredients from the image instead of both the ingredients and the instructions. However, upon further review, we realized that implementing the set transformer that actually predicts the ingredients is fairly comprehensive, and in the case we cannot find the data for recipe instructions, we plan to optimize our set transformer as much as possible to get the best results based on our dataset. Our plan currently is to implement the ingredient decoder to finish the first part of our project. Given the issue with the dataset, this is now our target goal. Once we do that, if we have the data from Recipe1M, we will try to implement the instruction decoder (which should be similar to the ingredient decoder). If we don’t have access to the data, we might consider writing our own scraper to gather the data, which would be our stretch goal now, as this includes significant data retrieval code.

Log in or sign up for Devpost to join the conversation.