Inspiration 💡

For nearly 10 years as a TESOL educator, I've heard the same question echoed across classrooms from Ulaanbaatar to Izmir, from Saigon to Kuala Lumpur: "How do I improve my speaking?"

The answer has always been frustratingly simple yet impossibly difficult: practice with real people. But for millions of learners worldwide, that's not an option. They're too shy to approach native speakers. They lack access to conversation partners in their communities. They can't afford private tutors or language schools.

SpeakFlow is my answer to that decade-long question. It's an attempt to democratize what was once a privilege: getting personalized, expert feedback on your speaking skills. No matter where you live, no matter your income, you deserve the chance to practice, to improve, and to be heard.

This isn't just a language learning tool—it's a bridge to opportunity, confidence, and connection for every learner who's ever felt their voice trapped behind a language barrier.


What it does

SpeakFlow is an AI-powered English speaking practice platform that provides learners with:

  • Real-time voice conversations with adaptive AI tutors powered by ElevenLabs, offering natural dialogue across diverse topics (casual, business, academic, travel, technology, health)
  • Automated CEFR placement testing using official Cambridge University assessment rubrics to accurately determine proficiency from A1 (beginner) to C2 (proficient)
  • Native audio analysis via Google Vertex AI Gemini 2.0 Flash that evaluates pronunciation, intonation, and prosody—not just text transcription
  • Comprehensive multi-criteria feedback across five key speaking dimensions: Range (vocabulary), Accuracy (grammar), Fluency (tempo & continuity), Interaction (turn-taking), and Coherence (organization)
  • Multilingual feedback support in 40+ languages so learners can understand suggestions in their native language
  • Progress tracking dashboard with session history and performance analytics to monitor improvement over time

Users can assess their level in 90 seconds, receive detailed evidence-based feedback, and practice at CEFR-appropriate difficulty levels tailored to their proficiency.


How I built it 🛠️

Frontend:

  • Next.js 16 (App Router) and React 19 for modern, performant UI
  • TypeScript for type safety across the entire codebase
  • Tailwind CSS for responsive, beautiful design
  • Framer Motion for smooth animations
  • Recharts for data visualization in progress tracking

AI & Backend:

  • ElevenLabs Conversational AI for ultra-low latency, natural voice interactions with level-specific agents (A1-C2 + assessment agent)
  • Google Cloud Vertex AI with Gemini 2.0 Flash for native multimodal audio analysis (processes raw audio, not just transcripts)
  • Firebase Authentication for secure Google Sign-In
  • Firestore for real-time database and session storage

Key Implementation Details:

  • Custom CEFR assessment prompts based on official Cambridge University ESOL descriptors (Table 5.5)
  • Unbiased assessment function that evaluates across the full A1-C2 range without anchor bias
  • Audio recording and processing pipeline that captures conversations and sends both audio + transcript to Vertex AI
  • Multi-agent system with 7 different ElevenLabs agents configured for different CEFR levels and assessment
  • Comprehensive feedback generation system that provides criterion-specific, actionable suggestions

Challenges I ran into 🧗

  1. Native Audio Analysis Integration - Getting Vertex AI Gemini 2.0 Flash to properly process raw audio files required careful handling of audio buffer encoding, MIME types, and multimodal prompting. I had to ensure the AI could analyze prosody, pronunciation, and intonation beyond just text transcription.
  2. Eliminating Assessment Bias - Creating a truly unbiased CEFR assessment was challenging. I had to design prompts that evaluate speakers across the full A1-C2 spectrum without "anchoring" to a preset level, ensuring accurate placement for both absolute beginners and near-native speakers.
  3. Real-time Conversation Flow - Orchestrating seamless transitions between user speech, AI processing, and agent responses while managing audio recording, transcription, and state required careful event handling and timing coordination with ElevenLabs' conversational AI SDK.
  4. Type Safety with AI Responses - Ensuring type-safe handling of Vertex AI responses was tricky, especially when the AI sometimes returned strings instead of arrays for feedback items. I implemented robust type checking and conversion utilities to handle edge cases.
  5. Multi-Agent Configuration - Managing 7 different ElevenLabs agents (one for each CEFR level plus assessment) with topic-specific prompts and CEFR-appropriate language required systematic configuration management and environment variable handling.
  6. Audio Recording & Processing - Implementing reliable browser-based audio capture, handling different audio formats (WebM), and converting to base64 for API transmission while maintaining quality was technically demanding.

Accomplishments that I'm proud of 🏆

  1. Evidence-Based Assessment - I successfully implemented authentic CEFR assessment using official Cambridge University standards, not ad-hoc criteria. This makes Speak Flow's feedback credible and aligned with international language proficiency frameworks.
  2. Three-Way AI Integration - I integrated three major AI services (ElevenLabs, Google Vertex AI, Firebase) into a cohesive, production-ready platform in a hackathon timeframe.
  3. Native Audio Analysis - I'm one of the few language learning platforms that analyzes audio directly for pronunciation and prosody insights, going beyond simple text-based evaluation.
  4. Natural Conversations - Using ElevenLabs Conversational AI, I created truly natural, adaptive voice interactions that feel like talking to a human tutor, not a robotic Q&A system.
  5. Multilingual Accessibility - Supporting feedback in 40+ languages makes the platform accessible to learners worldwide, regardless of their native language.
  6. Full-Stack TypeScript Application - I built a type-safe, scalable application with modern React patterns, comprehensive error handling, and a polished user experience.

What I learned 💡

  • Multimodal AI has immense potential - Native audio analysis with Gemini 2.0 Flash opened our eyes to possibilities beyond text-based assessment. Analyzing prosody, intonation, and pronunciation provides insights that transcription alone cannot capture.
  • Assessment design is hard - Creating unbiased, accurate language proficiency tests requires deep understanding of linguistic frameworks and careful prompt engineering. I learned how important it is to avoid anchor bias in AI assessment.
  • Real-time voice AI is transformative - ElevenLabs' Conversational AI showed me how natural and responsive voice interactions can be. The ultra-low latency makes practice sessions feel like real conversations.
  • User authentication and session management - Implementing secure Firebase authentication, session cookies, and usage limits taught us valuable lessons about production-ready user management.
  • The importance of evidence-based design - Using official Cambridge CEFR descriptors gave our feedback credibility and structure. Building on established educational frameworks is more valuable than inventing our own criteria.
  • Type safety matters in AI applications - Working with unpredictable AI outputs reinforced the importance of robust type checking, validation, and graceful error handling.

What's next for Speak Flow 🚀

Short-term:

  • Pronunciation-specific feedback with detailed phoneme-level analysis and practice drills for problem sounds
  • More conversation scenarios including job interviews, academic presentations, medical consultations, and customer service interactions
  • Adaptive in-conversation difficulty where AI dynamically adjusts complexity based on user performance mid-conversation

Medium-term:

  • Mobile applications with native iOS and Android apps for practice on-the-go
  • Curriculum integration with alignment to ESL/EFL textbooks and structured learning paths
  • Group conversation practice for multi-user sessions and collaborative speaking
  • Vocabulary and grammar insights with detailed analysis of lexical range and grammatical patterns

Long-term:

  • Gamification and achievements with streaks, badges, and challenges to boost motivation
  • Community features to connect learners for peer practice and language exchange
  • Teacher dashboard with tools for educators to assign practice, monitor student progress, and customize feedback
  • Accent training with specialized modules for specific accent goals (American, British, Australian, etc.)
  • Speaking exam preparation with targeted practice for IELTS, TOEFL, and Cambridge exams

My vision is to make high-quality, personalized English speaking practice accessible to every learner worldwide, regardless of geography or income.

Built With

Share this project:

Updates