💡 Inspiration

As students and lifelong learners, we've all faced the same problem: a mountain of documents to read. Whether it's a 40-slide lecture deck, a dense research paper, or a company onboarding manual, the process of reading alone is inherently passive and isolating.

You can't ask questions when you're confused.
You can't get a concept explained in a different way.
And you have no way of knowing if you actually understood the material.

The inspiration for Leksa came from that exact frustration.

What if every document could have its own teacher?

When we discovered the Gemini Live API with its real-time voice interaction and barge-in interruption capabilities, we realized we could build something powerful: an AI that doesn't just read documents — it teaches them.

Leksa transforms a lonely reading session into an interactive classroom experience.


🤖 What It Does

Leksa is your personal AI voice teacher that transforms documents into immersive real-time lectures.

Instead of silently reading a document, you can listen, interact, and ask questions, just like in a real classroom.

How it works

Upload Your Document
Upload any PDF or PowerPoint presentation.

Listen to Your Lecture
An AI teacher begins explaining the content out loud in a natural conversational voice.

Interrupt Anytime
If you're confused, simply speak. The AI pauses instantly, listens to your question, and answers before continuing.

Adaptive Learning
If a concept isn't clear, Leksa re-explains it in another way.

Comprehension Checks
After important segments, Leksa asks questions to ensure you understood the material.

Leksa turns passive reading into active learning.


🏗️ How We Built It

Leksa is a full-stack, cloud-native AI application designed for real-time voice interaction.

Our architecture combines Gemini AI models with Google Cloud infrastructure.

Backend — FastAPI on Google Cloud Run

We built the backend using FastAPI because it provides:

  • High performance
  • Native WebSocket support
  • Efficient real-time communication

The backend is:

  • Containerized using Docker
  • Deployed on Google Cloud Run
  • Automatically scalable for multiple users

Document Processing & Lecture Generation

When a user uploads a document:

  1. The file is stored in Google Cloud Storage

  2. The backend extracts text using:

  3. PyMuPDF for PDFs

  4. python-pptx for PowerPoint files

  5. The extracted text is sent to gemini-2.0-flash

This model acts as our Lecture Planner, converting the document into:

  • Structured lecture segments
  • Natural explanations
  • Comprehension questions
  1. The generated lecture structure is stored in Firestore.

Real-Time Voice Interaction

This is the core innovation of Leksa, powered by gemini-2.0-flash-live-001.

  • The React frontend captures microphone input using the Web Audio API
  • Audio is streamed to the backend via WebSockets
  • The backend forwards the stream to the Gemini Live API
  • Gemini processes speech and streams AI-generated voice responses back

Because Gemini Live supports barge-in detection, users can interrupt the AI mid-sentence.

This creates a natural conversational learning experience.


Frontend — React + Web Audio

The user interface is built using React.

It handles:

  • Document uploads
  • WebSocket connections
  • Real-time audio streaming
  • Voice playback

The frontend uses WebRTC / Web Audio API for microphone input and WebSockets for real-time communication.


⚡ Challenges We Ran Into

Bidirectional Audio Streaming

Managing stable two-way audio streaming over WebSockets was our biggest challenge. We had to carefully handle audio chunking, buffering, and synchronization to prevent lag.

Document Parsing

PDFs and PowerPoints often contain messy structures. We built a pre-processing pipeline to clean and structure the text before sending it to Gemini.

State Management for Interruptions

When users interrupt the AI, the system must pause, answer, and resume exactly where it stopped. We used Firestore session state tracking to manage this.

Cloud Deployment & Cold Starts

Cloud Run initially introduced cold start latency. We solved this by enabling minimum instances and optimizing our Docker image.


🏆 Accomplishments We're Proud Of

True Real-Time Barge-in

Users can interrupt the AI mid-sentence, creating a natural conversation instead of a static chatbot interaction.

Scalable Cloud Architecture

Leksa is a fully deployed Google Cloud application, capable of handling multiple users.

Adaptive AI Teaching

Leksa doesn't just read text aloud — it explains concepts, rephrases ideas, and quizzes users.

Voice-First Learning

The entire interaction is voice-driven, creating a hands-free learning experience.

Multi-Language Support

Leksa will support multiple languages, making it accessible worldwide.


📚 What We Learned

The Power of Gemini Live

The Gemini Live API enables extremely natural voice agents with latency often <500ms.

Voice UX Design

Voice applications require clear listening indicators and feedback signals to build user trust.

Full-Stack Cloud Architecture

We learned how to integrate Cloud Run, Cloud Storage, Firestore, and Cloud Build into a production-ready pipeline.

State Management

Tracking lecture progress and conversation context with Firestore was essential for smooth real-time interaction.


🚀 What's Next for Leksa

Visual Understanding

Future versions will use Gemini multimodal capabilities to explain charts, diagrams, and images.

Learning Analytics

Users will get dashboards showing comprehension progress and weak areas.

Collaborative Learning

Multiple users will be able to join the same lecture session and learn together.


🌟 Our Vision

Our goal is simple:

Make every piece of written knowledge accessible through conversation.

Because every document deserves to be heard.

Built With

  • fastapi
  • gemini-2.0-flash
  • gemini-2.0-flash-live-001
  • gemini-2.5-flash-native-audio-preview-12-2025
  • google-genai
  • pymupdf
  • python
  • python-dotenv
  • python-pptx
  • react
  • uvicorn
  • webrtc
  • websockets
Share this project:

Updates