Inspiration

As a final-year BSc Mathematics and Computer Science student at JKUAT, I've seen how often students get stuck on a problem or diagram and need more than a wall of text. Existing tools are either text-only or feel disconnected from the page in front of you. The hackathon's "Creative Storyteller" track and the idea of breaking the text box pushed me to build something that feels like a tutor sitting next to you: you show the page, ask in your own words (or voice), and get a live narrated explainer with diagrams that appear as the story unfolds—no copying prompts or switching apps.

What it does

The Living Textbook turns any textbook or homework into an interactive, narrated lesson. You capture a photo (camera or upload), ask a question in text or by voice (e.g. "Explain how photosynthesis works"), and the app:

Analyzes the image and your question with Gemini's vision and language models.
Plans a short, step-by-step explainer script with places for diagrams.
Streams narrated audio in real time via the Gemini Live API (natural voice, not robotic TTS).
Generates educational diagrams with Imagen 3 and weaves them into the same flow as the narration.

You see and hear one continuous "living document": title, sections, typewriter-style text, and images appearing in sync with the voice. It's multimodal input (image + text/voice) and multimodal output (audio + text + images) in a single experience.

How we built it

Frontend (Next.js 14, TypeScript, Tailwind, shadcn/ui): Camera/file capture, optional voice input, WebSocket client, and a "living document" canvas that appends title, status, section headers, typewriter text, and images as messages arrive. Audio is played from base64 PCM chunks (24 kHz) using the Web Audio API.
Backend (Python 3.11, FastAPI, WebSocket): Single /ws endpoint that keeps session state (photo, question). When the user sends a question, it runs an async pipeline: Vision Agent (Gemini 2.0 Flash + image) → Script Agent (structured JSON with narration + image_prompt per section) → in parallel, Visual Asset Agent (Imagen 3 per section, upload to GCS, signed URLs) and Narration (Gemini Live API session, section-by-section TTS). Results are streamed back as audio, text, image_url, section_start/section_end, etc.
Google Cloud: Backend on Cloud Run (Docker image, 1 Gi memory, 3600 s timeout); Vertex AI for Gemini Live (gemini-2.0-flash-live-preview-04-09), Gemini 2.0 Flash, and Imagen 3; Cloud Storage for generated images (signed URLs). Frontend is on Vercel; env var NEXT_PUBLIC_WS_URL points at the Cloud Run WebSocket URL.
Credentials and config: Service account for Vertex AI, GCS, and Cloud Run; .env for project, region, bucket, and model names; load_dotenv() in main.py so GOOGLE_APPLICATION_CREDENTIALS is set before any Google client is created.

Challenges we ran into

Live API model name: The generic name (gemini-2.0-flash-live-001) was wrong for Vertex AI; the correct one is gemini-2.0-flash-live-preview-04-09. Fixing that removed the "Publisher Model … not found" style errors.
Credentials not seen by Google libraries: GOOGLE_APPLICATION_CREDENTIALS lived only in Pydantic settings; the Google SDK reads it from os.environ. Calling load_dotenv() at the very top of main.py (before any google.* imports) fixed the "default credentials not found" error.
Cloud Run deploy from source: The service account didn't have permission to create the Artifact Registry repo; Cloud Build failed with "Permission 'artifactregistry.repositories.create' denied." We switched to building the Docker image locally (with corrected requirements.txt versions for the container), pushing to the existing Artifact Registry repo, and deploying that image to Cloud Run.
Vercel 404: The app lives in the frontend/ subdirectory. Having vercel.json at the repo root with custom outputDirectory broke framework detection. Moving vercel.json into frontend/ and setting the Vercel project Root Directory to frontend fixed it.
Python 3.14 and google-adk: On a machine with Python 3.14, some wheels weren't available and pip spent a long time compiling native deps. Using a venv and pinning versions that matched a working install (e.g. google-adk==1.26.0) made local and Docker builds consistent.

Accomplishments that we're proud of

Real interleaved output: Narration and images are generated and streamed together so the experience feels like one story, not separate text and image steps.
End-to-end on the required stack: Gemini (Live + Flash), Imagen, ADK-style agent flow, and multiple GCP services (Vertex AI, Cloud Run, GCS), with a clear path for judges to run it (README, env example, deploy notes).
Deployment: Backend on Cloud Run and frontend on Vercel, with WebSockets and env config working so the app is usable from a single URL.
Doing it as a final-year student: Shipping a full multimodal pipeline and deployment in the timeframe of the hackathon, while balancing final-year BSc (Math & CS) workload at JKUAT.

What we learned

Vertex AI vs "generic" Gemini names: Model identifiers and availability differ between the Gemini API and Vertex AI; always check the Vertex AI model list and docs for the correct name (e.g. for Live).
Where the Google SDK looks for credentials: Only os.environ; loading from .env must happen before any client construction.
Monorepo + Vercel: For a Next.js app in a subdirectory, the Vercel Root Directory must point at that folder; mixing root-level vercel.json with custom output dirs can break detection and cause 404s.
Designing for streaming: Defining a small WebSocket message protocol (photo, question, audio, text, image_url, section_*, done) and keeping client state in sync with that stream made the "living document" UI straightforward to implement.

What's next for The Living Textbook

Voice-in for the question: Use the Live API (or a dedicated ASR) so the user can ask the question by voice only, and optionally support follow-up turns over the same session.
Interactivity: Let users tap a generated diagram or a step in the script to get "go deeper" or "give me an example" for that part, still in narrated + visual form.
Offline / low bandwidth: Cache generated diagrams by topic or hash of the image+question, and optionally pre-generate audio for common scripts so repeat visits or slow networks still feel responsive.
Accessibility and language: Add language selection (e.g. Swahili, French) and clearer keyboard/screen-reader support so the Living Textbook is usable in more classrooms and by more students, including those with visual or motor impairments.

Built With

docker
fastapi
gemini-2.0-flash
google-cloud
google-cloud-run
google-vertex-ai-(gemini-live-api
imagen-3)
next.js-14
pydantic
python
shadcn/ui
tailwind-css
typescript
vercel
web-audio-api
websocket

Updates

Eugene Gabriel started this project — Mar 08, 2026 02:05 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.