Inspiration

Right now, AI is one of the biggest issues facing education in Canada. According to a 2024 KPMG study, six in ten Canadian students use generative AI for their schoolwork, and universities across the country are scrambling to figure out how to respond, with some reverting to pen-and-paper exams and others experimenting with oral assessments. The problem isn't that students have access to AI. It's that the most common way students use it is as a shortcut: paste the question in, copy the answer out, learn nothing. We thought that was a massive missed opportunity. AI shouldn't be replacing the learning process, it should be embedded in it. The best tutors don't communicate through text boxes. They talk to you, react to what you're writing, and adjust their tone based on whether you're stuck or on a roll. We were inspired by the idea of an AI that feels less like a chatbot and more like a real person sitting next to you: one that watches your work, speaks to you naturally, listens to you think out loud, and guides you toward the answer without ever just giving it to you. Instead of AI doing the work for students, we wanted to build AI that helps students do the work themselves, through real conversation, not copy-paste.

What it does

Tandem is a full-cycle AI study platform that teaches, tutors, and tracks student learning through three core features.

Lecture Mode: Students upload their course materials (notes, textbooks, slides, lecture recordings) and Tandem generates a full visual lesson, complete with dynamically created slides and a voice-over that teaches the content like a professor would. It doesn't just summarize the material, it actually walks through concepts, builds on ideas, and explains them in a structured, spoken format while allowing students to freely ask clarifying questions.

Attention Tracking: During lecture recordings, Tandem uses facial landmark detection to monitor student engagement in real time. When it detects that a student's attention dropped during a section, it flags those moments so lesson are generated with an emphasis on missed sections.

Whiteboard Tutoring: After learning the material, students move to a real-time digital whiteboard to practice problems. An AI tutor named Artie watches their work through computer vision, listens to them speak, and responds with natural voice, all in real time. If the student is on track, Artie encourages them. If they make a mistake, Artie gently points out where things went wrong and asks a guiding question to help them self-correct. When they get the answer right, Artie celebrates. The entire interaction happens through voice conversation, not text boxes. After questions are completed, students are give scores for specific skills and new questions can be adjusted accordingly based on the student's profile.

Together, these three features create a complete study loop: learn the material, identify weak spots, and practice with guided support.

How we built it

The frontend is built with Next.js, React, TypeScript, and Tailwind CSS. The whiteboard is powered by tldraw, which gives us a smooth drawing canvas that we can snapshot after each stroke. Math expressions in the problem display are rendered with KaTeX. When a student draws on the board, the snapshot is sent to a Python backend built with FastAPI. Google Gemini analyzes the whiteboard image against the current problem and works together with an ElevenLabs agent, detecting whether the student's work is correct, contains a mistake, or has reached the final answer. Voice interaction runs through ElevenLabs' conversational AI. Artie listens to the student's speech and responds naturally, and we pipe the whiteboard analysis into the conversation context so Artie can react to what the student writes, even interrupting itself mid-sentence to address new work on the board. Lecture generation takes uploaded course materials and uses Gemini to produce structured lesson content, which is then rendered as visual slides and narrated with ElevenLabs voice synthesis to deliver a full spoken lesson. Attention tracking is handled by MediaPipe and Presage SDK on the frontend, using facial landmark detection to monitor whether the student is engaged. Sections where attention drops are flagged so the student can revisit them. The backend runs on Google Cloud and orchestrates the session state, problem sequencing, and file processing with Presage for document summarization. Full tech stack: TypeScript, Python, Next.js, React, Tailwind CSS, Google Cloud, Gemini, ElevenLabs, Presage, FastAPI, tldraw, KaTeX, MediaPipe

Challenges we ran into

The hardest challenge was getting the ElevenLabs voice agent and the whiteboard analysis to work together seamlessly in real time. The voice conversation and the visual analysis are two completely independent systems that need to stay in sync. When a student writes something new on the board, Artie needs to immediately know about it and react, even if it's in the middle of saying something else. We had to build a system where whiteboard observations are injected into the voice conversation as contextual updates, and when something important happens (like a mistake or a correct answer), we forcefully interrupt the AI's current speech so it can pivot to address what just appeared on the board. Getting the timing, queuing, and context injection right between Gemini and the ElevenLabs agent so that conversations felt natural rather than robotic took significant iteration.

Accomplishments that we're proud of

We're really proud of how all the different APIs and systems come together into one cohesive experience. ElevenLabs for voice, Gemini for vision, MediaPipe and Presage for attention tracking, tldraw for the whiteboard: each one is complex on its own, but making them all talk to each other in real time and feel like a single unified tutor was the real accomplishment. The attention tracking feature is something we're particularly proud of. Using facial landmarks to gauge student engagement during lectures and then surfacing the exact moments where focus dropped gives students something actionable, not just a vague sense that they should study more. The fact that Artie guides students toward answers rather than giving them away is also something we worked hard to get right through careful prompt engineering. It's easy to make an AI that answers questions. It's much harder to make one that teaches.

What we learned

We learned a lot about the complexity of orchestrating multiple real-time AI systems. Individually, each API is well-documented and straightforward. But combining voice AI, computer vision, lecture generation, attention tracking, and a live drawing canvas into a single interactive loop exposed a ton of edge cases around timing, state management, and context windows. We also gained a deeper appreciation for prompt engineering in a tutoring context. Getting an AI to be a good teacher is a fundamentally different challenge than getting it to be a good assistant.

What's next for Tandem

The immediate next step is supporting live drawing transmission from external devices like iPads, so students can write with a stylus on their tablet and have it stream directly into the Tandem whiteboard. This would make the experience feel much more natural for students who prefer pen-on-screen over mouse drawing. Beyond that, we want to add session analytics so students can track their progress over time and explore multiplayer tutoring sessions where Artie can guide a small group of students working through problems together.

Built With

Share this project:

Updates