Page through the gallery to PLAY VISION MAMAAAAA!!
Keep going to find out what's new!
Keep going!
Read our detailed Devpost for more!
we ran experiments on fine-tuning a 7b model to write recipes!
we pretrained a tiny language model that is 4,375x smaller than GPT-3! And it worked kinda ok!!
Here's the training loss curve for the LLM if you like that kind of thing! :)
No Vision Pro? Check out our web-based AI agent at recipes.reflex.run
We use SERP API and a neat optimization algorithm to find the cheapest ingredients for your recipe and with minimal stores visited.

Click through our slideshow for a neat overview!!

Check out our demo video

The future of computing 🍎 👓 ⚙️ 🤖 🍳 👩‍🍳

How could Mixed Reality, Spatial Computing, and Generative AI transform our lives? And what happens when you combine Vision Pro and AI? (spoiler: magic! 🔮)

Our goal was to create an interactive VisionOS app 🍎 powered by AI. While our app could be applied towards many things (like math tutoring, travel planning, etc.), we decided to make the demo use case fun.

We loved playing the game Cooking Mama 👩‍🍳 as kids so we made a voice-activated conversational AI agent that teaches you to cook healthy meals, invents recipes based on your preferences, and helps you find and order ingredients.

Overall, we want to demonstrate how the latest tech advances could transform our lives. Food is one of the most important, basic needs so we felt that it was an interesting topic. Additionally, many people struggle with nutrition so our project could help people eat healthier foods and live better, longer lives.

What we created

Conversational Vision Pro app that lets you talk to an AI nutritionist that speaks back to you in a realistic voice with low latency.
Built-in AI agent that will create a custom recipe according to your preferences, identify the most efficient and cheapest way to purchase necessary ingredients in your area (least stores visited, least cost), and finally creates Instacart orders using their simulated API.
Web version of agent at recipes.reflex.run in a chat interface
InterSystems IRIS vector database of 10k recipes with HyDE enabled semantic search
Pretrained 40M LLM from scratch to create recipes
Fine-tuned Mistral-7b using MonsterAPI to generate recipes

How we built it

We divided tasks efficiently given the time frame to make sure we weren't bottlenecked by each other. For instance, Gao's first priority was to get a recipe LLM deployed so Molly and Park could use it in their tasks. While we split up tasks, we also worked together to help each other debug and often pair programmed and swapped tasks if needed. Various tools used: Xcode, Cursor, OpenAI API, MonsterAI API, IRIS Vector Database, Reflex.dev, SERP API,...

Vision OS

Talk to Vision Mama by running Whisper fully on device using CoreML and Metal
Chat capability powered by GPT-3.5-turbo, our custom recipe-generating LLM (Mistral-7b backbone), and our agent endpoint.
To ensure that you are able to see both Vision Mama's chats and her agentic skills, we have a split view that shows your conversation and your generated recipes
Lastly, we use text-to-speech synthesis using ElevenLabs API for Vision Mama's voice

AI Agent Pipeline for Recipe Generation, Food Search, and Instacart Ordering

We built an endpoint that we hit from our Vision Pro and our Reflex site. Basically what happens is we submit a user's desired food such as "banana soup". We pass that to our fine-tuned Mistral-7b LLM to generate a recipe. Then, we quickly use GPT-4-turbo to parse the recipe and extract the ingredients. Then we use the SERP API on each ingredient to find where it can be purchased nearby. We prioritize cheaper ingredients and use an algorithm to try to visit the least number of stores to buy all ingredients. Finally, we populate an Instacart Order API call to purchase the ingredients (simulated for now since we do not have actual partner access to Instacart's API)

Pre-training (using nanogpt architecture):

Created large dataset of recipes. Tokenized our recipe dataset using BPE (GPT2 tokenizer) Dataset details (9:1 split): train: 46,826,468 tokens val: 5,203,016 tokens

Trained for 1000 iterations with settings: layers = 12 attention heads = 12 embedding dimension = 384 batch size = 32

In total, the LLM had 40.56 million parameters! It took several hours to train on an M3 Mac with Metal Performance Shaders. loss curve

Fine-tuning

While the pre-trained LLM worked ok and generated coherent (but silly) English recipes for the most part, we couldn't figure out how to deploy it in the time frame and it still wasn't good enough for our agent. So, we tried fine-tuning Mistral-7b, which is 175 times bigger and is much more capable. We curated fine-tuning datasets of several sizes (10k recipes, 50k recipes, 250k recipes). We prepared them into a specific prompt/completion format:

You are an expert chef. You know about a lot of diverse cuisines. You write helpful tasty recipes.\n\n###Instruction: please think step by step and generate a detailed recipe for {prompt}\n\n###Response:{completion}

We fine-tuned and deployed the 250k-fine-tuned model on the MonsterAPI platform, one of the sponsors of TreeHacks. We observed that using more fine-tuning data led to lower loss, but at diminishing returns.

Reflex.dev Web Agent

screenshot of web chat Most people don't have Vision Pros so we wrapped our versatile agent endpoint into a Python-based Reflex app that you can chat with! Try here Note that heavy demand may overload our agent.

IRIS Semantic Recipe Discovery

We used the IRIS Vector Database, running it on a Mac with Docker. We embedded 10,000 unique recipes from diverse cuisines using OpenAI's text-ada-002 embedding. We stored the embeddings and the recipes in an IRIS Vector Database. Then, we let the user input a "vibe", such as "cold rainy winter day". We use Mistral-7b to generate three Hypothetical Document Embedding (HyDE) prompts in a structured format. We then query the IRIS DB using the three Mistral-generated prompts. The key here is that regular semantic search does not let you search by vibe effectively. If you do semantic search on "cold rainy winter day", it is more likely to give you results that are related to cold or rain, rather than foods. Our prompting encourages Mistral to understand the vibe of your input and convert it to better HyDE prompts.

Real example: User input: something for a chilly winter day Generated Search Queries: {'queries': ['warming winter dishes recipes', 'comfort food recipes for cold days', 'hearty stews and soups for chilly weather']} Result: recipes that match the intent of the user rather than the literal meaning of their query

Challenges we ran into

Programming for the Vision Pro, a new way of coding without that much documentation available Two of our team members wear glasses so they couldn't actually use the Vision Pro :( Figuring out how to work with Docker Package version conflicts :(( Cold starts on Replicate API A lot of tutorials we looked at used the old version of the OpenAI API which is no longer supported

Accomplishments that we're proud of

Learning how to hack on Vision Pro! Making the Vision Mama 3D model blink Pretraining a 40M parameter LLM Doing fine-tuning experiments Using a variant of HyDE to turn user intent into better semantic search queries

What we learned

How to pretrain LLMs and adjust the parameters
How to use the IRIS Vector Database
How to use Reflex
How to use Monster API
How to create APIs for an AI Agent
How to develop for Vision Pro
How to do Hypothetical Document Embeddings for semantic search
How to work under pressure

What's next for Vision Mama: LLM + Vision Pro + Agents = Fun & Learning

Improve the pre-trained LLM: MORE DATA, MORE COMPUTE, MORE PARAMS!!! Host the InterSystems IRIS Vector Database online and let the Vision Mama agent query it Implement the meal tracking photo analyzer into VisionOs app Complete the payment processing for the Instacart API once we get developer access

Impacts

Mixed reality and AI could enable more serious use cases like:

Assisting doctors with remote robotic surgery
Making high quality education and tutoring available to more students
Amazing live concert and event experiences remotely
Language learning practice partner

Concerns

Vision Pro is very expensive so most people can't afford it for the time being. Thus, edtech applications are limited.
Data privacy

Thanks for checking out Vision Mama!

Built With

cursor
finetuning
github
gpt
intersystems
iris
jupyternotebook
lora
mistral
mixedreality
monsterapi
openai
python
reflex
replicate
swift
vision
visionos
visionpro
vscode
xcode

Submitted to

TreeHacks 2024
- Winner Codegen: Best Use of AI Agents ($3k Cash)
- Winner Best Use of Monster Generative AI APIs (4x XBox Series S [1st] & 1 million Monster API credits [2nd] & $400 Cash [3rd])

Created by

Worked on the cooking mama 3D model on Blender. Animations of her blinking and talking would be paired with the text-to-speech model

Molly Cantillon
I built the VisionOS UI, transcription using Whisper, and synthesis using Eleven.

Andrew Park
Mostly backend AI stuff. Preparing datasets, training and fine-tuning LLMs, making the agent endpoint, vector database and semantic search.

Designed the vision mama logo and presentation.

Andrew Gao