In a busy restaurant, everything moves fast — except information. Staff walk between tables, trying to remember who just sat down, who’s been waiting too long, and which table is free. Meanwhile, cameras quietly record everything… but understand nothing.

That gap inspired CafeEye AI.

CafeEye AI is built on a simple idea: What if a restaurant could see, understand, and respond — just like a human assistant?

To bring this to life, I designed a real-time multimodal AI system that combines vision, reasoning, and voice into one seamless experience.

At its core, CafeEye uses YOLO-based computer vision to detect people and monitor table occupancy across multiple zones. But detection alone isn’t enough — so I integrated Gemini 2.5 Flash Native Audio to act as the brain, allowing the system to interpret live visual data and understand the state of the restaurant.

Then comes the most human part — voice. Using the Gemini Live API, CafeEye can hold real-time conversations. Staff can simply ask: “Which table has been waiting the longest?” —and get an instant spoken answer.

I also built a lightweight AI-powered ordering system, where customers can place orders and receive voice confirmations, making the experience interactive and intuitive.

Building CafeEye wasn’t just about connecting APIs — it was about making them work together in real time. Synchronizing camera feeds, voice input, AI reasoning, and UI updates required careful handling of threading, event loops, and state management. I faced challenges like broken audio streams, API rate limits, and deploying vision-based systems on the cloud — each one pushing me to think deeper and optimize smarter.

What I learned is powerful: Real innovation happens when systems don’t just process data — they understand and interact with the real world.

CafeEye AI is more than a project. It’s a step toward a future where everyday environments become intelligent, responsive, and alive.

From passive cameras to active intelligence — CafeEye AI is the restaurant that sees, thinks, and speaks.

Built With

Share this project:

Updates