Inspiration

I was bad at school. Not for lack of trying. I just couldn't retain anything from textbooks. I'd stare at a page for 20 minutes and come away with zero. But the second someone drew it out or I watched a video? Got it. I'm a visual learner and most classrooms don't really account for that. When the Gemini API hackathon showed up I thought: I could build the thing I wished existed when I was struggling through homework. Snap a photo of a confusing textbook page or a whiteboard, and get back a short narrated video lesson with actual illustrations. That became Lumio.

What it does

You feed Lumio a photo, video, or typed question, and it produces a video lesson. Gemini analyzes the input, writes a script, generates illustrations per section. ElevenLabs narrates, Suno adds background music, and the whole thing gets stitched into a playable video with chapters. There's a quiz at the end.

How we built it

Backend is FastAPI on Google Cloud Run. It runs the full pipeline: Gemini 2.0 Flash for content and images, ElevenLabs for TTS, Suno for music. Frontend is Next.js with Framer Motion, also on Cloud Run.

The flow:

  1. User uploads media or types a question
  2. Personalization step: pick a character, tone, difficulty
  3. Gemini streams the script and generates illustrations in parallel
  4. Narration and music are generated server-side
  5. Assembled into a video player with chapter navigation
  6. AI-generated quiz at the end

Challenges I ran into

Audio in production was a saga. Cloud Run throttles CPU between requests, which silently killed background tasks. Switched to --no-cpu-throttling and kept a warm instance. Then 413 errors, chapters with base64 images were too fat for the video prep endpoint. Had to strip images client-side before sending. CORS bit me too. NEXT_PUBLIC_* env vars in Next.js resolve at build time, not runtime. My prod frontend was still hitting localhost:8080 after I'd already set the var in Cloud Run. Fix: hardcode the API URL in the Dockerfile before the build step. Not elegant, but it works. Prompt engineering for Gemini was its own rabbit hole. Getting it to produce structured, consistent lesson content that also worked visually took more iterations than I expected.

Accomplishments that im proud of

It works end-to-end. That's the main thing. You upload a photo and a few minutes later there's a video lesson with narration, illustrations, and music playing back at you. I didn't fully believe the pipeline would hold together until I saw it happen for real.The personalization turned out better than expected. Users pick a character, a tone, and difficulty. Gemini adapts the teaching style based on those picks, and it does it more naturally than I thought it would. Kids especially seem to respond to the character thing.Deploying the whole pipeline on Cloud Run and keeping it stable was its own project. Streaming, image generation, TTS, music, video assembly any one of those can break in production in ways that never show up on localhost. Getting all of them to behave together took a lot of debugging.And honestly, just finishing. Solo project, tight deadline. I shipped something I'd use myself. That's enough.

What we learned

First time going truly end-to-end: prompts, cloud infra, frontend polish. Wiring Gemini + ElevenLabs + Suno together taught me a lot about orchestrating services that all fail in different ways. Streaming in React has its own headaches. Cloud Run has quirks nobody warns you about. And building something I actually needed as a kid made it easier to keep going at 2am when things broke.

What's next for Lumio AI Video Lessons from Any Source

Multi-language is the most obvious next step. Gemini already handles it, so it's really about adding a language picker and localized TTS voices. Spanish first since that's my audience, then French, then we'll see. Right now a lesson is one short video. I want to support longer content, upload something dense and get back a series of progressive lessons, more like a mini-course than a one-off explainer. I'd also like to add a classroom mode where teachers create lessons and share them, students take the quiz, progress gets tracked. That's a bigger lift though. A mobile app feels like the natural home for this. Point your phone camera at a textbook, get a lesson. The current web version works on mobile but a native app would be faster and smoother. The quiz system is bare-bones right now. I want spaced repetition, more question types, and some kind of dashboard so you can see what you've actually retained vs. what you just clicked through. That part needs the most work. Voice interaction is the stretch goal. Gemini's live audio could let students ask follow-up questions mid-lesson and get spoken answers back. No idea how hard that'll be in practice but it's where I want to get eventually.

Built With

  • elevenlabs-api
  • fastapi
  • framer-motion
  • google-cloud-build
  • google-cloud-run
  • google-gemini-2.0-flash
  • next.js
  • python
  • react
  • suno-api
  • tailwind-css
  • typescript
Share this project:

Updates