☕ CafeEye AI — The Restaurant That Sees, Thinks & Speaks

💡 What Inspired Me

Walking into a busy restaurant and watching staff struggle to track empty tables manually sparked the idea. Cameras are already everywhere in restaurants — but they just record. They don't think. I asked myself: what if a camera could understand what it sees and talk to you about it in real time?

That question became CafeEye AI.

🏗️ How I Built It

CafeEye is a full-stack multimodal AI system built in layers:

Layer 1 — Vision (Eyes)

  • Live camera feed processed in real time using YOLO (You Only Look Once) object detection
  • Detects people at tables with bounding boxes and confidence scores
  • 6 configurable table zones tracked simultaneously

Layer 2 — Intelligence (Brain)

  • Gemini 2.5 Flash Native Audio as the core reasoning model
  • Camera frames + sensor data sent together for multimodal understanding
  • The model sees AND reasons — not just pattern matching

Layer 3 — Voice (Voice)

  • Gemini Live API for real-time voice interactions
  • Staff can ask questions like "Which table has been waiting longest?"
  • AI responds with spoken natural language answers
  • Order confirmation spoken aloud to customers

Layer 4 — Ordering System

  • Full restaurant menu with veg/non-veg categories
  • AI recommends dishes by voice
  • Customers type orders, AI confirms by speaking

The Math Behind Detection:

The confidence threshold for person detection is:

$$P(\text{person} | \text{box}) > \tau, \quad \tau = 0.4$$

Table occupancy duration tracked as:

$$D_{\text{table}} = t_{\text{current}} - t_{\text{arrival}}$$

Alert triggered when:

$$D_{\text{table}} > 1800 \text{ seconds} \quad (30 \text{ mins})$$

🛠️ Tech Stack

Component Technology
Vision Detection YOLO v11 (Ultralytics)
AI Brain Gemini 2.5 Flash Native Audio
Live Voice Gemini Live API
Real-time Streaming Google GenAI SDK
Frontend Dashboard Streamlit
Camera Processing OpenCV
Audio I/O SoundDevice
Deployment Google Cloud Run + Streamlit Cloud

🧠 What I Learned

Building a truly live multimodal agent is fundamentally different from building a chatbot. The hardest challenges were:

  1. Synchronizing vision and audio — camera frames and voice responses run at different speeds and need careful threading

  2. Gemini Live API nuances — the response comes in multiple chunks (model_turn, generation_complete, turn_complete) and must be collected carefully before playback

  3. Real-time state management — tracking 6 tables simultaneously while handling voice input without freezing the UI required background threading

  4. Cloud deployment of vision apps — cameras don't exist on cloud servers, requiring creative solutions for demo and deployment

🚧 Challenges Faced

  • Audio quality — Initial voice output was breaking and repeating. Fixed by collecting the entire audio buffer before playback instead of streaming chunks

  • Gemini quota limits — Free tier hit rate limits during testing. Solved by increasing analysis intervals and optimizing API calls

  • OpenCV on cloudlibGL.so.1 missing on cloud servers required switching to opencv-python-headless with system library dependencies

  • Streamlit + asyncio — Streamlit's event loop conflicts with Python's asyncio. Solved by running each voice session in a dedicated thread with its own fresh event loop

🎯 Real-World Impact

CafeEye addresses a genuine problem:

  • Restaurants lose revenue when tables aren't turned over efficiently
  • Staff waste time manually checking table availability
  • Customers wait longer than necessary to be seated

CafeEye solves all three — automatically, intelligently, and in real time.

🚀 What's Next

  • Multi-camera support for large restaurants
  • Customer emotion detection (happy, waiting, frustrated)
  • Integration with POS systems for automatic billing
  • Mobile app for restaurant managers
  • Analytics dashboard with weekly/monthly reports

Built With

Share this project:

Updates