What it does
LearnAloud is a voice AI tutor that teaches you from any document while visually annotating exactly what it's explaining — in real time.
Upload a PDF, say "teach me this," and as the tutor speaks, it:
- Highlights the sentence it's currently explaining
- Drops margin notes with key definitions
- Draws arrows between related concepts
- Auto-scrolls to keep you in sync
It's like having a private tutor sitting next to you, pointing at the page and saying "right here — this is the part that matters."
Why Linode / Akamai Cloud
LearnAloud isn't a toy demo — it's a latency-sensitive, compute-heavy pipeline that needed infrastructure that could actually keep up. I chose Linode Kubernetes Engine (LKE) because it gave me the flexibility to deploy each component as an independent, scalable service without fighting the platform to do it.
The workload breaks down into three distinct compute profiles:
- Embedding + vector search — CPU-bound, bursty, scales horizontally with LKE node pools
- PDF processing — memory-intensive per-request work, isolated in its own microservice
- Voice pipeline coordination — low-latency orchestration layer where consistent network performance matters
Having these as separate LKE deployments meant I could scale the embedding service independently during heavy load without over-provisioning everything else. Linode's straightforward pricing made this actually practical to run.
Architecture on LKE
┌─────────────────────────────────────────────┐
│ Linode Kubernetes Engine │
│ │
│ ┌──────────────┐ ┌───────────────────┐ │
│ │ PDF Service │ │ Embedding Service│ │
│ │ (PyMuPDF) │───▶│ (OpenAI + FAISS) │ │
│ │ 2 replicas │ │ 3 replicas │ │
│ └──────────────┘ └───────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Orchestration Service │ │
│ │ (LLM calls + annotation triggers) │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ React Frontend (served via LKE ingress) │
│ └──────────────┘ │
└─────────────────────────────────────────────┘
All services are fully open-source:
- PyMuPDF for PDF parsing
- FAISS for vector search
- LangChain for RAG orchestration
- LiveKit (open-source) for WebRTC voice
- React frontend
- Kubernetes manifests and Helm charts — fully reproducible deployment
The Hard Technical Problems
PDF coordinate extraction — Getting exact bounding boxes for text spans requires parsing PDF internals at the span level, not just extracting raw text. PyMuPDF's block/line/span hierarchy gave me pixel-accurate positions that feed directly into the annotation renderer.
Sub-200ms voice-visual sync — Any perceptible lag between the tutor's voice and the highlight breaks the experience completely. The solution: pre-compute annotation trigger timestamps inside the LLM response, buffer them against the audio timeline, and fire Client Actions slightly ahead of the corresponding speech. LKE's consistent network performance between services was critical here — variable inter-pod latency would have broken this entirely.
Stateful interruption handling — When a user asks a follow-up question mid-explanation, the annotation state pauses and resumes cleanly from the correct position. This required careful state management across the orchestration layer.
GPU-accelerated embedding (bonus) — The embedding service is deployed on a GPU-enabled Linode node pool, cutting batch embedding time for large PDFs by ~4x compared to CPU-only inference. This matters for the initial document ingestion step where the user is waiting.
Open Source & Reproducibility
Full source code, Kubernetes manifests, and a one-command deployment guide are available in the public GitHub repository. Anyone can fork this and deploy their own instance on LKE in under 15 minutes using the provided Helm chart.
What's Next
- Edge deployment: Move the voice coordination layer to Akamai edge nodes to reduce STT/TTS round-trip latency for global users
- Chrome extension: Bring LearnAloud to any web article without a PDF upload step
- Multi-agent architecture: A Pedagogue agent that teaches + a Librarian agent that pulls in supporting research on demand, both running as separate LKE services
- Mobile: Camera scan for physical textbooks
Why It Matters
Students and researchers spend enormous time fighting the gap between reading and understanding. LearnAloud collapses that gap by making the explanation inseparable from the document itself. Built open-source, deployed on LKE, and designed to scale — this is what modern AI-native applications should look like.
Log in or sign up for Devpost to join the conversation.