Scolara

Inspiration

As students and tutors ourselves, we know that a truly good tutor is more than just a provider of information: they support, reassure and encourage.

The AI study market is flooded with tools that push out content. But how do students respond emotionally to that information? This area of the EdTech sphere lacks a human touch.

What it does

That is why we have built Scolara- an AI study companion that adapts its teaching strategy to your emotional response. Confused? Scolara knows to split the problem into atomic steps before you even ask. Focussed? Scolara will keep you on track. Scolara leverages AI TTS and STT technology to simulate a real human tutor, ensuring our students feel supported at every step.

How we built it

Expression Engine: Python FastAPI backend captures webcam frames from the browser every 2 seconds, sends them to Google Gemini Flash Lite for facial expression classification (Focused/Fatigued/Distracted), and logs timestamped results.

AI Tutoring Layer: A separate FastAPI service takes the detected expression and user input, then prompts Gemini to generate study materials tailored to the student's emotional state — simpler explanations when confused, encouragement when fatigued.

Speech-to-Text: Node.js WebSocket server receives audio segments from the browser's MediaRecorder API and transcribes them via ElevenLabs' Scribe API in near real-time.

Text-to-Speech: FastAPI endpoint proxies text to the ElevenLabs TTS API and streams audio back to the React frontend for playback.

Frontend: React 19 with Vite and Tailwind CSS 4, featuring a bookshelf-style session library with Framer Motion animations and a glassmorphic input interface.

Database: Supabase Postgres schema with tables for sessions, messages, topic mastery tracking, and a pgvector-powered knowledge base for RAG. Google OAuth via Supabase Auth handles user identity.

Dev tooling: Claude Code guided API integration, WebSocket architecture, and debugging throughout.

Challenges we ran into

Difficulty in leveraging Presage for stress detection, expression assignment and lack of Python API
Providing the GenAI models with sufficient context about the user's learning goals
Integration of different functionalities- lack of compatibility between the different languages and modules we used.
ElevenLabs translated into other languages and picked up sound effects
ElevenLabs introducing high latency into our application, which is extremely inconvenient

Accomplishments that we're proud of

Facial expression recognition- hacky solution ( rather than streaming continuous live feed, we passed images to Gemini API every 2 seconds )
Minimisation of latency in retrieving speech output from ElevenLabs
Building the backend, authentication and database

What we learned

Importance of communication and active collaboration to ensure compatibility
How to build data schemas, manage a PostrgreSQL user database and implement authentication
How to call APIs including ElevenLabs, Supabase, Google Gemini API

What's next for Scolara

Reduce latency in generating speech from text so that we can use user expressions to more accurately inform teaching strategy
Integration of interactive media types eg. videos, dynamic diagrams, fill in the gaps, timed challenges
More sophisticated emotion-adaptive learning for each mood
Wider mood identification selection
Establishment of a closer partnership with students to understand their evolving needs.
Teaching plan structured around deadlines and learning goals