Inspiration
I noticed that most language and speech apps feel like talking to a brick wall. You record a clip and wait for a loading spinner. You get a grade. I wanted to build something that felt like a real human coach sitting across from you. I wanted a mentor who interrupts you the second you make a mistake and guides you through the physical mechanics of speech.
What it does
Fluently is a real time multimodal speech coach named Aura. Unlike standard apps it uses a hybrid interface. I built a gamified dashboard for progress and goal setting. This screen transitions into an immersive Gemini Live session. Aura proactively leads the conversation. She listens to the user's phonetic accuracy. She provides instant verbal feedback without the user ever having to touch a submit button.
How we built it
I architected Fluently to be Google Cloud Native from the ground up. I used the Gemini Multimodal Live API via the Google GenAI SDK to handle full duplex audio streaming. Gemini 1.5 Pro serves as the brain. It utilizes Server side Voice Activity Detection to handle natural conversation flow and interruptions. I hosted the logic on Google Cloud Run. I used Firestore to store user profiles and lesson progress. The frontend is a React based UI styled with Tailwind CSS to create a bubbly experience that remains responsive while the heavy audio lifting happens in the background.
Challenges we ran into
The biggest hurdle was latency and audio clarity. Early prototypes had cracking audio and robotic pauses. I had to deep dive into PCM audio formatting and WebSocket synchronization to ensure the microphone data reached Gemini in a high fidelity format. Audio was processed in chunks of length L where Lā100ms to balance latency and overhead. I also struggled with the silence problem where the AI would wait for the user to lead. I solved this by implementing a Proactive Lesson State Machine that forces the agent to initiate and drive the session.
Accomplishments that we're proud of
I am incredibly proud of achieving Barge In functionality. Being able to interrupt the AI mid sentence creates a live factor that feels like magic. The AI stops and listens immediately. This required fine tuning the probability threshold P(S) where S is the event that the user has started speaking. I also successfully bridged a traditional UI with a high performance WebSocket stream. This makes the app feel like a polished product rather than just a technical demo.
What we learned
I learned that in the world of Live Agents the user experience is defined by timing. A 500ms delay is the difference between a natural coach and a laggy bot. I also gained a deep understanding of how to ground a generative model using structured system instructions. This ensures the coaching advice remains linguistically accurate and avoids hallucinating correct pronunciations.
What's next for Fluently
I want to re integrate the Vision component to its full potential. I will use Gemini to analyze lip and tongue placement via the camera feed in real time. I also plan to add Veo generated modeling clips. The app will generate a custom video of a human mouth pronouncing the exact word the user is struggling with. This provides a truly 360 degree multimodal learning experience.
Built With
- geminiapi
- react
- tailwind
Log in or sign up for Devpost to join the conversation.