Turning fridge ingredients into recipes

Title

AutoChef: turning fridge ingredients into recipes

Who

Names (login): Shiyu Yan (syan43); Paul Zhu (zpaul2); Yiqun Zhang (zyiqun); Yuhan Mao (myuhan).

Introduction

Our goal is to solve the problem faced by many people who struggle to know how to make these dishes by hand after purchasing many raw ingredients. The aim of this project is to generate detailed, easy-to-understand recipes from images containing raw ingredients by developing a deep learning model. Our motivation stems from the growing interest in home cooking, which digital media tools can help with. The project falls under the category of structured prediction in machine learning, which utilizes a combination of image recognition and language models to output a series of recipes based on visual input.

Related Work

Cooking recipes are semi-structured text data containing recipe titles, ingredients, and cooking instructions. Thanks to recent developments in the field of natural language generation, language models trained on large-scale data can generate good human-readable text. This paper Ratatouille: A tool for Novel Recipe Generation explores the process of using LSTM and GPT-2 models (using Transformer) to generate cooking recipes. Initially, researchers tried to use the LSTM model and trained it on the RecipeDB data set for character-level and word-level generation. However, since these models mainly capture structural features rather than detailed relationships between components and descriptions, researchers later turned to GPT-2, a Transformer-based language model, and showed better performance in terms of the BLUE score. The research is part of research into text generation based on deep neural networks, particularly for applications such as recipe creation. The research in this paper is novel in the specific application of formula generation and focuses on the relationship between ingredients and formulas.

Data

The data set we use is the Food Ingredients and Recipes Dataset with Images, which comes from the Kaggle website (specially created by it from the Epicurious website).

The data set contains a CSV file and a compressed file, containing 13,582 rows of data and images respectively, with a total size of 216MB. The CSV file contains 5 columns, namely Title (representing the title of the dish), Ingredients (representing the ingredients of the dish), Instructions (representing the recipe instructions when making the dish), Image_Name (corresponding to the image name in the compressed file) and Cleaned_Ingredients (Indicates processed ingredients). The compressed folder contains all the images corresponding to the rows in the CSV file and is named according to Image_Name.

The second data set we used is Food.com Recipes with Search Terms and Tags, which is also from the Kaggle platform. It contains 500,000 recipes, has a size of 242MB, and is mainly sourced from content uploaded by users on the Food.com platform. The content includes the recipe's metadata and search queries for the tags the user assigned to the recipe. The reason we chose this data set is that it contains complete recipe instructions, which is very helpful for us to use LSTM, a language model, to train our recipe generation model, thereby bringing a better experience to our users.

Methodology

The core of our architecture consists of two main components: a pre-trained Convolutional Neural Network (CNN) for detecting ingredients from images and a language modeling model for generating the corresponding recipes.

Step 1: Image recognition using CNN:

We utilize YOLOV8, a pre-trained CNN model for object detection. The model is fine-tuned on our dataset, which consists of a variety of food images to accurately recognize various dishes and their ingredients.CNN is the first step in our workflow, which extracts key visual features and ingredient information from the input image.

Step 2: Generating Recipes with Transformers

Model Architecture: We use the transformer model because of its ability to manage sequential data and maintain context awareness in long text sequences. The architecture of the transformer is designed around a self-attention mechanism that allows it to weigh the importance of different words in the recipe text to generate more coherent and contextualized recipes.

Data Tokenization and Embedding: Prior to training, recipe text is tokenized at the character level and embeddings are generated to uniquely represent each character. This fine-grained approach helps the model learn the deep structure of the cooking language.

Integration with image recognition: the output of the CNN (i.e., recognized ingredients and dishes) is the input to the converter. This connection is critical because it ensures that the text generated by the converter is based on the visual data observed by the CNN.

Training process: The converter is trained using pairs of inputs (CNN outputs) and target sequences (corresponding recipes). During training, the model learns to predict the next character in the sequence and gradually improves its ability to generate the full recipe as training progresses.

Backup strategy:

If the initial results are unsatisfactory, we plan to explore alternative model architectures and training methods. Possible adjustments include trying to generate text using a LSTM model, which may have advantages in capturing the global context of a recipe.

Metrics

Definition of Success: Success refers to the ability of a model to generate accurate, coherent recipes from dish images, with a focus on correct ingredient recognition and sound recipe structure.

Accuracy: Accuracy alone does not fully reflect the efficiency of recipe generation. We will use other metrics for a comprehensive evaluation.

Experimentation:

Ingredient Detection Accuracy: Precision, recall and F1 score to evaluate Transformer's ability to recognize ingredients.

Recipe generation quality: BLEU scores for text correspondences and ROUGE scores for n-gram overlap with reference recipes.

User feedback: User studies to evaluate the usefulness, consistency and clarity of the recipes.

Basic goals:

English sentences generated by Transformer (with basic logic). The generated text contains strings of some ingredients.

Target goals:

We can get an F1 score of 70% or more. Create recipes that are grammatically correct. We want to achieve a ratio of 5% of dishes.

Extended goals:

We can get an F1 score of 80% or more; the BLEU/ROUGE score of the recipe matches the BLEU/ROUGE score of the reference very well, showing strong coherence and user satisfaction.

Ethics

Why is Deep Learning a good approach to this problem?

Our final project comprises two stages: image recognition and text generation, both of which are well-suited to deep learning models. Specifically, CNNs excel in image recognition tasks, as they can automatically and efficiently capture spatial hierarchies in images by learning filters that recognize patterns, such as edges and textures, essential for distinguishing different objects and features within an image. Meanwhile, transformers, with their attention mechanisms, excel at understanding context and generating coherent, contextually relevant text. Thus, deep learning is the ideal approach for solving the problem we aim to address.

How are you planning to quantify or measure error or success? What implications does your quantification have?

We plan to use the F1 score as our measure of success because it strikes a balance between precision, which measures the accuracy of positive predictions, and recall, which assesses the model's ability to identify all actual positive cases. Consequently, it serves as a well-rounded metric of success. However, a high F1 score doesn’t guarantee a natural and easy-to-understand response. Therefore, we might consider adding a step to optimize our output.

Division of Labor

Paul Zhu is responsible for data collection and pre-processing. Shiyu Yan is responsible for word embedding and model architecture. Yuhan Mao is responsible for tweaking the model to try fine-tuning. Yiqun Zhang is responsible for using YOLOV8 for multi-object detection.