Inspiration
Learning about nutritional value in my high school health class, I realized that many products commonplace in our home pantries have highly processed content or low nutritional value. However, it isn't trivial to distinguish which products are/aren't healthy when looking at long, complex nutritional labels. Even if you could do so, it is even more difficult to find objectively healthier alternatives with similar taste profiles.
Thus, when I came across the Open Food Facts dataset, I saw the opportunity to use ML to streamline this process for the average consumer in an easy, intuitive way.
What it does
cleanLabels is a web app featuring two main tools:
- Intelligent Swaps: Users search for a specific food product within our database to see its nutritional profile and Nutri-Score grade (A-E). They can then set a healthier 'target grade' (e.g., aiming for a C or better), and our AI will recommend the top 3 alternative products that share the most similar ingredient/nutritional profile but meet the target grade.
- Custom Recipe Predictor: In cases where a product is not within the database, this tool allows users to input raw ingredients and macronutrients to instantly see what our ML model would predict the Nutri-Score to be.
How we built it
- We built the frontend using the Streamlit library in Python.
- The data was sourced from the Open Food Facts dataset (from Kaggle), which we cleaned to focus on only US-sourced products for the purpose of this initial version. We only preserved ingredient text data and the FDA standard macronutrient columns for each entry to model real-world data.
- We then vectorized each entry using TF-IDF vectorization on the raw ingredient text alongside scaled numerical data for the macronutrients to build a multimodal vector space.
- Our models were built through the Scikit-learn ML library in Python: the first was a K-Nearest Neighbors (KNN) using cosine similarity to find similar taste profiles (through ingredient and nutritional similarity), and the second was a Random Forest (RF) Regressor to infer health scores of novel products. Our evaluation data reported a MAE of ±0.27, RMSE of 0.56, and R2 of 0.9961, demonstrating high accuracy and deployability in novel contexts.
Challenges we ran into
- Our initial Random Forest model was too large (over 160MB) because of all its decision trees, so we couldn't push it to GitHub for deployment. We used joblib compression and restructured the model to compress the filesize without sacrificing noticeable accuracy.
- The dataset was standardized to European data formats, such as using kJ for energy instead of kcal (common in the US), so we had to account for that to ensure conversions were correctly accounted for when training models and building the frontend.
Accomplishments that we're proud of
- Successfully combining text and numerical data into a single, cohesive data pipeline
- Building an intuitive frontend that allows users to tailor their results for their specific goals.
What we learned
- We realized that TF-IDF vectorization allowed us to prevent generic/common ingredients (like water or salt) from dominating the similarity searches and reducing their accuracy, thus allowing us to focus on unique ingredients to accurately measure similar taste profiles.
- Having raw numerical output (such as a floating point prediction score for a novel product) is not very helpful. Working on the frontend to give a streamlined, efficient experience for a potential user to reduce as much friction as possible is important.
What's next for cleanLabels
We would like to integrate a CV component such that, instead of entering ingredient and nutritional information manually (which can be an arduous process), users can simply upload a picture of a nutritional label to make the process much faster.
Built With
- kaggle
- pandas
- scikit-learn
- streamlit
Log in or sign up for Devpost to join the conversation.