Home Screen dashboard
Reference library example
Session summary notes
Intermediate/Advanced conversation with translation and captions

Contigo - Project Story

Inspiration

Most popular language learning apps share a similar core problem in that they are "gamified," with points, streaks, and matching games. They feel productive but they don't push you towards fluency. It's the same pattern as learning to code by only reading tutorials, or a baby bird learning to fly by watching from the nest. Application of knowledgeis the most important and effective form of learning anything, and language learning is no different.

As a self-taught Spanish learner, in the beginning I'd study grammar, memorize vocabulary and complete every exercise in my grammar book, but I would freeze when trying to talk with a native speaker. My big break came from a trip to Spain where I had infinite opportunities and even sometimes, no choice but to speak Spanish. My fluency advanced in a matter of weeks much farther than a whole year of studying did. The difference was so simple, and the best part was that these conversations I would have with native speakers were just a daily part of my life there. It didn't feel like I was studying or had to set away extra time to practice grammar alone, as everything was being applied while living my life.

But not everyone can up and fly to Spain, or any Spanish-speaking country and live there for an extended time period. Finding someone to consistently converse with you in your target language can be difficult, and even when you do find the opportunity to practice in conversation, here comes the anxiety of making mistakes in front of a real person. For this reason I built Contigo: an AI voice tutor that feels like having a real conversation partner who happens to have infinite patience, adapts to your level in real-time, and never makes you feel embarrassed for getting something wrong.

What it does

Contigo is a real-time voice-based Spanish conversational partner powered by Gemini 3 and Elevenlabs. You speak, it responds in natural Spanish adapted to your exact level.

Core experience: Open the app, pick your starting difficulty, and start talking. The voice agent responds with natural sounding audio, and adapts its complexity based on how you're doing, while also tracking your learning patterns across sessions.

Gemini 3 powers the brain behind the voice:

Real-time turn analysis with Thought Signatures that maintain reasoning continuity the AI remembers why you struggled with ser vs. estar three turns ago and references it later
Adaptive difficulty orchestration -- Gemini autonomously decides when to promote or gentle-down the difficulty based on error patterns, confidence, and fluency signals
Cross-session learning synthesis -- using Gemini's 1M token context window, Contigo analyzes your full conversation history and synthesizes patterns across multiple sessions ("You've improved on subjunctive mood since last week")
Session summaries with extracted Spanish snippets, personal connections, and learning insights
Cultural reference detection -- Gemini identifies songs, idioms, and cultural references mentioned in conversation and saves them to your Reference Library

Features that make it feel human:

Cycled greetings -- each load into the home screen starts with a different authentic Spanish greeting from across the Spanish-speaking world (Argentina's "Che!", Venezuela's "Epa!", Mexico's "Que onda?"), with cultural context tooltips that inspire curiosity about regional diversity
Reference Library -- paste in song lyrics, a news article, or a story you love, and practice conversing about your interests. Learning stops feeling like homework when you're talking about things you actually care about
Live captions + translation -- blurred by default (to encourage listening), revealed on tap with one-click English translation for when you're stuck

How we built it

Frontend: React 18 + Vite with Tailwind CSS. Designed with a warm aesthetic -- textured backgrounds, geometric color blocks -- inspired by Luis Barragán, a Mexican architect large in "Emotional Architecture." This design makes the app feel like a creative space than a clinical learning tool. Motion (Framer Motion) for fluid animations.

Voice Pipeline: ElevenLabs Conversational AI handles real-time speech-to-text and text-to-speech via WebSocket. Audio is captured as 16-bit PCM at 16kHz, streamed to the backend, and responses are queued for seamless playback.

Backend (Voice Engine): Python FastAPI service that orchestrates the conversation. Each cluster of 4 turns is analyzed by Gemini 3 for grammar, vocabulary, and fluency patterns. Thought Signatures (persisted in Redis) maintain reasoning state across turns and sessions. The tutor service coordinates between ElevenLabs for voice, Gemini 3 for analysis, and PostgreSQL for learning history.

Backend (Core API): Node.js Hono framework handling authentication (Google OAuth + JWT), session management, and the Reference Library CRUD API.

AI Architecture: Gemini 3 is the primary intelligence layer with Cerebras Llama 3.3-70B as a fallback. Gemini handles: turn cluster analysis (medium thinking), session summarization (high thinking), difficulty assessment, and cultural reference detection. The system uses configurable thinking levels -- quick analysis uses medium reasoning depth, while end-of-session synthesis uses high reasoning for deeper pattern recognition.

Infrastructure: Production deploys to Google Cloud Run (backend) and Netlify (frontend) with Neon for managed PostgreSQL and Upstash for managed Redis.

Challenges we ran into

Thought Signature persistence: Getting Gemini's reasoning state to carry across sessions required a Redis-backed storage layer. This was a highly effective solution in getting the agent to feel more natural and create this intelligent "memory" that makes the conversation feel personal.

Balancing AI depth vs. latency: Gemini 3 with high thinking produces remarkably better session summaries, but takes longer. We use tiered thinking levels -- medium for real-time mid-session analysis, high only for end-of-session synthesis where latency is acceptable.

Primary Beginner level As a beginner with little to no knowledge, there does require some background of vocab and grammar to be able to even form a few sentences. For a long time, the beginner level voice agent would not slow down and carried out conversations with complex grammar structures and non-beginner vocabulary. Working on tuning down this beginner agent to a highly basic conversational agent required a lot of different strategies, including limiting vocabulary, sensing high struggles in the user when on beginner mode, and detailed prompting.

Accomplishments that we're proud of

Thought Signatures. Gemini references earlier turns naturally -- "You're still mixing up ser/estar like you did earlier" -- which makes the tutoring feel coherent rather than stateless.
The adaptive difficulty system makes autonomous decisions. Gemini analyzes error rates, confidence, and fluency signals, then decides independently whether to promote or ease the difficulty. No hard-coded thresholds for the AI decision -- the model reasons through it.
Cross-session synthesis is useful. After multiple sessions, summaries reference your trajectory: "You've made significant progress on past tense conjugation since your first session." This is only possible because of Gemini 3's 1M token context.
It feels like a real conversation, not a drill. The combination of natural voice, adaptive difficulty, and personal interests through the Reference Library creates something that feels closer to chatting with a friend than doing exercises.
Immersion Increase The addition of the reference library allows for the user to import their own interests, turning it from a learning experience into further exploration and immersion. For example, if I wanted to learn what Bad Bunny said in one of his songs, I can paste the lyrics and learn along with the Voice agent. Also, although a very small feature, the cycled greetings from various Spanish-speaking countries is another way Contigo aims to unlock discovery. The user sees a greeting they don't know, or one they want to start using, and this can inspire a whole rabbit hole of learning about that specific region's culture. This genuine interest and thirst for discovery is what keeps learners inspired, and prevents burnout.

What we learned

Thinking levels matter. The difference between Gemini 3 with low vs. high thinking on session analysis is dramatic. High thinking catches subtle patterns (gender agreement errors only with indirect objects) that low thinking misses entirely.
Thought Signatures are a form of memory. By persisting the model's own reasoning summary and feeding it back as context, you get a lightweight but effective form of cross-session continuity without fine-tuning.
Voice-first changes everything about language learning UX. When the interface is conversation, not forms, learning feels easier and low-stress. The blurred captions with revealable translations was a deliberate choice to keep the focus on listening.

What's next for Contigo

Multi-language support: The architecture is language-agnostic -- expanding to other languages quickly is possible
Classroom style system In a classroom setting, Contigo could assist a professor by compiling each students' level, interests, and common mistakes to help in guiding the direction of lessons, provide insights about each student, and provide a tool for students to practice outside of the classroom. -- Overall systematic tuning: Paying attention to further refine agentic prompts, priming guardrails, and adding further access to new beginners to ease the beginning of their language learning journey
Situational nudges: Although agents can remember personal facts about the user across sessions, the next step is for the agent to really push to apply them ("You mentioned you're visiting Mexico City next month -- have you practiced ordering food?")
Pronunciation scoring: Using Gemini's audio understanding capabilities for accent and pronunciation feedback
Mobile app: React Native with offline-first capabilities for practicing on the go
Open-source curriculum: Community-contributed conversation topics and cultural deep-dives
Weekly Spotlight articles -- highlighted and relevant Spanish-language content to spark new conversation topics

Gemini 3 Integration Description

Contigo uses Gemini 3 as its core intelligence layer for real-time language learning analysis. Five distinct Gemini-powered capabilities drive the experience:

1. Turn Cluster Analysis -- Every 4 conversational turns are analyzed by Gemini 3 with medium-depth thinking to identify grammar errors, vocabulary gaps, and fluency patterns. Results include learner-facing suggestions and tutor-facing guidance for mid-session adaptation.

2. Thought Signatures -- Gemini's reasoning state is persisted to Redis and re-injected as prior context in subsequent analyses, creating cross-turn and cross-session reasoning continuity. The model naturally references earlier struggles.

3. Adaptive Difficulty Assessment -- Gemini autonomously evaluates whether a learner should be promoted or eased to a different difficulty level, reasoning through error rates, confidence signals, and conversation complexity.

4. Session Summarization -- Using Gemini 3's 1M token context with high-depth thinking, full conversation transcripts plus prior session summaries are synthesized into structured learning reports with Spanish snippets, personal connections, and longitudinal insights.

5. Cultural Reference Detection -- Gemini identifies songs, idioms, movies, and cultural references mentioned in conversation, extracting them into the learner's Reference Library for future study.

Built With

Updates

Matt Evitts started this project — Feb 09, 2026 07:56 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.