Topic modeling is an important approach to aid in understanding complex and diverse bodies of text. Today many topic modeling algorithms have been deployed to produce suggestions to users based on known their preferences. These models employ statistical methods that examine the frequency of words in a document in an attempt to extract the topics that compose the document.
We saw that topic modeling could be applied to new areas beyond traditional long-form text documents in order to improve user interaction with the abundance of data available. In this project we decided to apply topic modeling to a dataset of recipes in order to produce recommendations based on users' established recipe preferences.
Our approach is novel in that instead of using simple word frequency as input to our model, we developed a unique "bag-of-ingredients model" ( an adaptation of the bag-of-words model ) that contains core ingredients for each respective recipe with the relative amounts of each ingredients called for in the recipe. This is used as input to train our model. We believe that with such an approach we can discover common patterns in recipes to recommend new recipes that would taste similar and be appealing to the user.
What it does
The front-end web-app provides the user with an interface to our custom built database of recipes and their interrelationships, as determined by our model. The user can search for an existing recipe in our database of over 20,000 recipes and find full recipes with instructions and ingredients. In addition to the recipe, each query displays eight of the most similar recipes in the database as classified by our model. Users can then traverse the database of diverse recipes, discovering recipes that are interrelated in a unique way.
How we built it
The web-app revolves around the central recipe database we built from the original recipe dataset ( from kaggle ). Using Python we first scrapped the relevant recipe ingredient information from the dataset, transformed it to be input into the Latent Dirichlet allocation (LDA) algorithm to identify topics in the dataset, and then finally wrapped the final JSON Firebase database with all of the original data and the generated predictions to be accessed by the front-end web app developed with flask, react, and bootstrap.
In order to train the LDA model and construct the "bag-of-ingredients", a significant amount of pre-processing was performed on the dataset. All of the ~20,000 recipe's ingredients were parsed to extract their relative quantity within the recipe ( ex: 2 cups of .... ), as well as the 2,000 core ingredient names associated with these quantities that were manually classified from the entire corpus of recipes, in order to properly train the model. This processing included the use of several tools from the NLTK such as stop words and lemmatization.
LDA model training
With the bag-of-ingredients constructed and our associated dictionary of core ingredients, we used the LDA model provided in the NLTK Python library to train our model. We assume that each recipe contains a probability distribution of each topic. Recipes with similar probability distributions of ingredients, we then conclude are related and contain the same topics. After experimenting with the model hyperparameters to achieve what appeared to be sensible topics we transformed the model and applied it to our dataset using a k-nearest neighbors approach to determine the 8 most topically similar recipes for each recipe.
Front-end and database construction
A number of different web technologies were incorporated to develop the front-end web-app. The web-app has the ability to search our database of recipes and return relevant images alongside the displaced recipes, using data collected from the Bing Image Search API via Microsoft Azure.
Challenges we ran into
The two largest challenges we faced were
- Processing the selected dataset to build coherent and well-structured input for training of the LDA model
- Creating a polished web-app that was both performant and visually appealing
Due to the complexity of the dataset and the task of transforming the recipe data in our own bag-of-ingredients model, building the training dataset was one of the most time-consuming parts of the project. Extensive trial and error was required to determine the trade-off between performance and feasibility.
Combining multiple web technologies and frameworks allowed the functionality we desired for the web-app, but it made the development process somewhat slow and cumbersome. Further optimization of the structure could allow for better performance.
Accomplishments that we're proud of
The novelty of our project lies in the development and successful deployment of the bag-of-ingredients model to generate unique recipe recommendations using our trained LDA model. We found that from the diverse recipes included in our database, the model is quite good at creating cohesive groupings of recipes that have a clear interrelation.
What we learned
We anticipated that the data pre-processing step would be somewhat time consuming, but this project certainly demonstrated the effort involved in developing a sound dataset for training a machine learning model. This project furthered our familiarity and understanding with a number of Python, web, and database tools.
What's next for Recipe Revealer
There are certainly improvements to be made on the front-end of the web-app to provide a better user experience. Even though our model seems to work fairly well there is still room for improvement. In order to improve the LDA model to produce better recipe predictions we need to gather a larger database of recipes to train the model and incorporate the associated recipes into our database. This will allow for more diversity in recipe choices and will strength the prediction model when recommending new recipes to users.