Inspiration

Learning is often a lonely experience. Many students study by themselves late at night or face physical and social barriers that make it hard to ask for help. We wanted to make something that feels human, a presence that learns with you, listens to your reasoning, and helps only when you truly need it.

That idea became Tutoroo, an AI that behaves like a study partner who learns beside you. It listens, observes, and sometimes even acts confused, asking “dumb” or curious questions to make you explain your ideas out loud. When you get stuck or ask for help, it transforms into a knowledgeable tutor that gives a complete explanation and then steps back so you can keep thinking on your own.

What it does

Tutoroo is an adaptive, multimodal AI learning companion that uses voice, vision, and machine learning to make studying feel like a conversation. It combines three inputs to understand the learner in real time:

🎙️ Microphone: Listens to the learner’s thoughts and speech to follow their reasoning and detect when help is needed.

🎥 Camera: Watches the chalkboard or workspace to read handwritten math or notes.

🖥️ Screen share: Sees the problem or question being solved to connect context with action.

👁️ Eye tracking: Observes what the learner is focusing on to understand attention and pacing.

🙂 Facial recognition: Detects confusion, frustration, or confidence and adjusts tone and timing accordingly.

Tutoroo runs in two synchronized modes:

Co-Student mode: The default mode where the AI behaves like a slightly clueless but curious classmate. It asks simple or funny questions like “Wait, why are you dividing here?” or “So that means the fraction gets smaller, right?” These moments make the learner explain and reflect, which strengthens understanding.

Tutor mode: Activated only when the user clearly says “help.” The system instantly provides a complete, step-by-step explanation that it already solved in the background. Once it’s done, it returns to Co-Student mode and continues learning alongside the user.

Over time, Tutoroo learns how each person learns. It uses a machine learning model that studies speaking patterns, reaction speed, gaze direction, and emotional signals to adapt its personality and response depth.

How we built it

Tutoroo’s architecture combines real-time AI reasoning, adaptive modeling, and scalable data infrastructure:

Frontend: Built with React, Next.js, and Tailwind CSS to create a responsive, FaceTime-style interface. The UI supports real-time camera, mic, and screen sharing using WebRTC.

Authentication: Managed through Auth0, which provides secure user login, session management, and encrypted access to personalized learning data.

Backend: Powered by Gemini AI API, which handles multimodal reasoning across text, visual, and audio input streams. Gemini runs on an event-driven backend that routes tasks through a message queue for fast response.

Machine Learning Pipeline:

  • A custom PyTorch model monitors user tone, gaze, and engagement to predict focus and frustration levels.
  • Reinforcement learning is used to fine-tune question difficulty and timing.
  • The adaptive learning model stores embeddings representing each learner’s habits, making Tutoroo more personal over time.
  • Computer Vision Layer: Built with OpenCV and MediaPipe, performing real-time eye tracking, gaze detection, and facial emotion recognition.

Data Storage and Analytics:

  • Snowflake serves as the centralized data warehouse for session logs, performance metrics, and ML feedback loops.
  • Data pipelines run through Apache Airflow, which manages daily aggregation and anonymized analytics to improve adaptive models.
  • Voice System: Integrated ElevenLabs TTS for high-quality speech output, filtered through a natural language pre-processor to ensure clarity and emotional tone.
  • Security and Privacy: Sensitive data such as facial tracking and speech logs are encrypted at rest and in transit using AES-256 and JWT-based access tokens from Auth0.
  • Performance Optimization: Tutoroo uses asynchronous inference calls with Gemini’s streaming API and TensorRT acceleration for low-latency responses.

Tutoroo also supports LaTeX rendering for clear math output. For example:

2x^2+7x-3=23

Solve for X

Challenges we ran into

  • Synchronizing multimodal data streams across Gemini’s API, ML inference, and WebRTC inputs.
  • Building real-time gaze and emotion tracking without privacy risks.
  • Keeping latency low while integrating multiple APIs.
  • Training an adaptive model that changes behavior without overfitting to one user.
  • Ensuring ElevenLabs TTS output stayed natural even for complex, technical content.

Accomplishments that we're proud of

We built a truly adaptive learning system that feels human. Tutoroo can see, listen, and respond in real time while learning who the user is. It does not just provide answers, it learns how you learn and grows with you every session.

What we learned

We learned that AI becomes most powerful when it learns with the user, not just for them. Building Tutoroo taught us how to synchronize multimodal data in real time, how to personalize AI behavior through reinforcement signals, and how to balance empathy with technical precision.

We also realized that accessibility is not just a feature, it is a foundation. Eye tracking, emotion recognition, and natural speech made Tutoroo accessible to learners who may otherwise feel left behind.

What's next for Tutoroo

We plan to integrate multilingual support, expand personalization analytics through Snowflake dashboards, and build an open educator API that allows teachers to monitor engagement data (with full user consent). We are also exploring integrating Gemini 2.0’s streaming multimodal API for faster, higher-fidelity reasoning and using Federated Learning to train personalization models locally for stronger privacy.

Tutoroo’s mission is simple: to make learning feel shared, human, and intelligent.

Built With

Share this project:

Updates