Inspiration

I founded this project on a very stressful day. I was juggling two full-time jobs: my actual career and helping my younger sister with her homework. I had meetings to attend and deadlines to hit, but she needed help understanding her math problems. I found myself rushing her, giving answers instead of explanations just to buy time. That wasn't fair to her.

I realized: what if there was a "digital big sibling"? Someone who could sit by her side, see her homework through a camera, and patiently guide her through it step-by-step without me needing to pause my work? That was the spark for Lumi.

What it does

Lumi is a real-time, multimodal AI tutor that mimics a video call with a human teacher.

  • It Sees: Lumi uses the device's camera to look at physical textbooks, worksheets, or screens.
  • It Hears: Students talk to Lumi naturally. No typing required.
  • It Teaches: Instead of giving answers, Lumi uses the Socratic method. It asks guiding questions to help the student find the solution themselves.
  • It Visualizes: If a student is stuck, Lumi can generate diagrams or visual aids in real-time to explain concepts (e.g., "Show me a pie chart of this fraction").

How we built it

Lumi is built on the cutting edge of browser capabilities and Generative AI.

  • Frontend: We used React with Vite for a snappy, modern UI, styled with Tailwind CSS.
  • The Brain: The core is Google's Gemini 2.5 Flash Multimodal Live API. This model allows for native audio and video streaming. We stream microphone audio (PCM 16-bit) and camera frames directly to Gemini over a WebSocket.
  • Audio Processing: We implemented custom AudioWorklets to handle raw pulse-code modulation (PCM) audio processing in the browser, ensuring low-latency communication.

Challenges we ran into

  • Latency is Authority: To make it feel like a real person, the delay had to be minimal. Managing the bidirectional WebSocket stream for audio chunks and video frames without creating a "laggy" experience was the hardest engineering challenge.
  • The "Tutor" Persona: LLMs love to be helpful which often means just solving the problem. We spent significant time prompt-engineering the system instruction to force Lumi to be a teacher, not a calculator. It has to withhold the answer and guide the student instead.
  • Audio Gremlins: Handling echo cancellation and microphone sample rates across different browsers and devices (desktop vs. mobile) required a lot of trial and error with the Web Audio API.

Accomplishments that we're proud of

  • True Real-Time Interaction: We achieved a conversation flow that feels almost human. You can interrupt Lumi, show it a new image, and it adapts instantly.
  • Visual Intelligence: Seeing Lumi correctly identify a handwritten math problem and immediately start explaining it without any text input from the user was a "magic moment."
  • Emotional Support: We successfully prompted the AI to be encouraging and empathetic, turning frustration into confidence for the student.

What we learned

  • Multimodal > Text: Voice and vision add dimensions of context that text chat simply can't match. The AI can hear hesitation in a voice or see confusion on a paper.
  • Stream Management: We gained deep expertise in handling binary data streams (ArrayBuffers) in JavaScript.
  • Gemini's Power: The Gemini Multimodal Live API is a game-changer for accessibility, allowing us to build interfaces that don't require keyboards or touchscreens just voice.

What's next for Lumi

  • Subject Expansion: Adding specialized modules for history, science, and literature.
  • Parent Dashboard: A companion app for guardians to see what their child learned and where they struggled, without needing to supervise the session.
  • Mobile Native: Porting the web experience to React Native for a truly portable "tutor in your pocket" experience.

Built With

Share this project:

Updates