Inspiration

Students often study alone when they get stuck on difficult problems. They search online, read textbooks, or watch videos, but learning is far more effective when a teacher can see the student’s work and guide them step by step.

At the same time, many students around the world do not have access to high-quality tutoring. Parents also struggle to understand where their children are facing difficulties or how to support them.

I wanted to build a tutor that doesn’t just answer questions, but actually collaborates with the student — seeing what they see, hearing their reasoning, and guiding them interactively.

By leveraging Gemini’s multimodal capabilities, I created an AI tutor that behaves more like a real teacher sitting next to the student.


What it does

EDUVA AI Private Tutor is a real-time multimodal learning assistant powered by Gemini.

Students can talk to the tutor using natural voice conversation while sharing their learning materials.
The AI can analyze:

  • textbooks
  • handwritten notes
  • PDFs
  • whiteboards
  • screen-shared applications

Instead of simply answering questions, the tutor actively collaborates with the student by:

  • highlighting mistakes
  • pointing to important concepts
  • drawing visual explanations
  • guiding the student step-by-step through the solution

Key Capabilities

  • 🎤 Real-time voice conversation with the AI tutor
  • 👁 Visual understanding of PDFs, whiteboards, and screen sharing
  • ✏️ Live annotations and highlights directly on the student's workspace
  • 💡 Context-aware suggestions for the next learning step
  • 📓 Persistent notebook storing explanations, formulas, and summaries

This creates a shared learning workspace where the student and AI tutor solve problems together.


How I built it

The AI Private Tutor is powered by Gemini 2.5 Flash using the Gemini Live API to enable real-time multimodal interaction.

The system streams voice and visual context simultaneously to the AI, allowing the tutor to understand both what the student is saying and what they are looking at.

Core Technologies

  • Gemini 2.5 Flash – multimodal reasoning
  • Gemini Live API – real-time voice & vision streaming
  • Google GenAI SDK – model integration
  • React + TypeScript – interactive frontend
  • Node.js – session orchestration
  • WebAudio API – low-latency voice streaming
  • pdfjs-dist – PDF analysis
  • KaTeX – mathematical formula rendering
  • Google Cloud Run – scalable cloud deployment

A custom context capture engine composites the student's visual workspace (PDFs, whiteboards, screen share) into optimized frames that are streamed to Gemini alongside the voice input.


Challenges we ran into

One of the biggest challenges was synchronizing voice explanations with precise visual annotations.

When the tutor says:

“Look at this equation.”

the system must ensure the annotation appears exactly at the correct location in the student's workspace.

This required building:

  • a coordinate transformation engine
  • a multimodal synchronization pipeline

Another challenge was enabling natural interruptions.
Students often interrupt teachers mid-explanation, so we implemented real-time barge-in logic allowing students to stop the tutor and ask follow-up questions naturally.

We also optimized real-time streaming to maintain low latency while processing both audio and visual inputs.


Accomplishments that we're proud of

We successfully built a fully interactive multimodal tutoring system that goes far beyond a traditional chatbot.

Highlights include:

  • Low-latency real-time voice conversation
  • 👁 Visual understanding of the student workspace
  • ✏️ AI-generated annotations on learning materials
  • 🌍 Tutor personas and cultural adaptation
  • 📓 Automatic notebook creation with formulas and summaries

The result is an AI tutor that can see, hear, and collaborate with the student in real time.


What we learned

Building a multimodal AI tutor taught us that true learning assistance requires more than text interaction.

Voice alone is not enough — visual context is essential.

By combining voice and vision through Gemini, we created a far more natural tutoring experience.

We also learned that real-time AI systems require precise synchronization between:

  • audio streams
  • visual context
  • AI reasoning

What's next for EDUVA AI Private Tutor

Our next goal is to expand EDUVA into a complete AI learning ecosystem.

Future developments include:

  • 🎓 personalized subject-specific AI tutors
  • 📊 deeper learning progress tracking
  • 📈 adaptive learning paths based on student performance
  • 👨‍👩‍👧 parent insight dashboards
  • 🌍 support for more languages and education systems

My vision is a world where every student has access to a personalized AI tutor available anytime they need help.

Built With

Share this project:

Updates