foodly.ai: Your AI Sous-Chef That Actually Sees What's Cooking

Inspiration

As college students, we all dread having to cook our own meals. Takeout just isn't sustainable on a budget and recipes don’t teach you how to cook. In a world with AI, why can’t there be an on-demand assistant that helps you with the whole cooking process? This inspired us to create foodly.ai, a cooking agent that takes you through recipes, shopping, and cooking.

What it does

foodly.ai transforms your phone into an active sous chef. Just speak to foodly or point your camera to unlock:

Live Voice and Video Guidance: Using the Gemini Live API, foodly provides real-time voice feedback while watching over your cooking, telling you when to stir or if your oil is shimmering.
Remembers Your Past: It uses long-term "Session Memory" to recall that you struggled with a specific technique last week or that you're allergic to peanuts, building a personalized cooking profile over time.
Adapts to You: Whether you're a beginner or an advanced chef, foodly adjusts its tone and complexity based on your demonstrated skill level and preferences.
Generates Recipes on the Fly: Need a 15-minute vegan pasta? foodly drafts the recipe, saves it to your history, and then walks you through it step-by-step with visual guidance.
Summarizes Your Progress: Every session ends with a structured breakdown of what you did well, techniques to practice, and personalized tips for next time.

How we built it

foodly.ai is built on a high-performance multimodal stack designed for low-latency feedback and seamless real-time interaction:

The Frontend: A sleek React Native (Expo) app that handles real-time audio (16kHz PCM) and video frame streaming via WebSockets, optimized for kitchen environments with clear, readable UI elements.
The Brain: We integrated the Gemini Live API to process bidirectional audio and video in real time. For static recipe generation and session summarization, we utilized Gemini 2.5 Flash Lite for faster non-interactive tasks.
The Memory: We used Supabase with the pgvector extension. Every session is turned into a semantic embedding using text-embedding-004, allowing foodly to "remember" relevant past experiences during a live session through intelligent retrieval.
The Backend: A FastAPI (Python) server acts as the orchestrator, managing WebSocket connections, handling authentication via Supabase Auth, and building complex system prompts infused with user-specific data and memory context.

Challenges we ran into

Latency & Synchronization: Streaming video frames and audio chunks simultaneously over WebSockets while maintaining a truly "live" feel required significant tuning of frame rates and audio buffer sizes. Balancing responsiveness with bandwidth constraints on mobile networks was a critical challenge.
Context Injection: Feeding "memory" into a live AI stream is tricky injecting too much context makes responses slow, while too little loses the personalization magic.
Hardware Realities: Designing a UI that works when a user is 3 feet away from their phone and their hands are wet or covered in food required high-contrast "Status Pills," animated recording dots, and clear, loud audio cues. Mobile phone cameras and microphones in kitchen lighting also presented unique challenges.
Multimodal Coordination: Ensuring that visual understanding, audio input, and LLM reasoning stayed synchronized required careful orchestration of data pipelines and fallback mechanisms when network conditions degraded.

Accomplishments that we're proud of

Seamless Continuity: Seeing the AI say, "Hey, remember last time when the garlic burnt? Let's turn the heat down a bit earlier today," felt like magic. The personal, contextual nature of the guidance created a genuine coaching experience.
Multimodal Integration: Successfully bridging the gap between a mobile camera/mic and a cloud-based LLM with minimal delay achieving sub-second response times for cooking guidance creates a natural interaction that feels like having a sous-chef present.
The Aesthetic: Creating a "Warm foodly" brand using cream, amber, and brown tones to make the kitchen feel like a cozy, inviting space rather than a high-tech lab. The design philosophy prioritized approachability over technical complexity.
Real-time Session Memory: Implementing a working system where past cooking sessions inform current guidance without creating latency bottlenecks demonstrates the practical potential of RAG in real-time applications.

What we learned

We learned that working with Gemini Live Video is fundamentally different from using Gemini Flash, especially in terms of complexity. While Gemini Flash is optimized for fast, text-based interactions and simple API calls, Gemini Live Video requires handling real-time streams, continuous context updates, and synchronizing visual input with user interactions. Compared to just plugging pre-recorded videos into an LLM, using a live model was significantly more challenging, it introduced issues like latency, state management, and the need to process and react to frames on the fly rather than analyzing a static input. This forced us to rethink our architecture and build for responsiveness and adaptability, rather than a straightforward request-response flow. Overall, it showed us that live multimodal systems demand a much deeper level of engineering, but also unlock far more interactive and context-aware experiences.

What's next for foodly.ai

We are incredibly excited by foodly.ai's potential and are committed to expanding this vision. Our immediate next steps include:

Smart Glasses: Moving Foodly from the phone stand to your face. Integrating with Meta Glasses for a zero-friction, hands-free experience where the AI sees exactly what you’re seasoning in real-time.
Multi-Device Sync: Allowing a tablet to act as the "Recipe Screen" while the phone acts as the "Stove Camera," enabling family cooking scenarios where multiple people can follow along together.
Grocery Integration: Automatically adding missing ingredients from a generated recipe to an Instacart or Amazon Fresh cart, creating a seamless bridge between planning and shopping.
Community: Letting users share their recipes and "Session Summaries" with friends and followers, creating a sense of cooking community.
Advanced Vision: Training the model to recognize specific stovetop temperatures, meat doneness levels, and other visual indicators through advanced computer vision, reducing reliance on user verbal descriptions.
Dietary & Preference Profiles: Expanding the memory system to include complex dietary requirements, cultural cuisine preferences, and ingredient substitution knowledge, making foodly an increasingly personalized companion.