Linode Based Vocal tutor

Linode Kubernetes Cluster
Linode Deployment
Interface
Voice Controls

What it does

LearnAloud is a voice AI tutor that teaches you from any document while visually annotating exactly what it's explaining — in real time.

Upload a PDF, say "teach me this," and as the tutor speaks, it:

Highlights the sentence it's currently explaining
Drops margin notes with key definitions
Draws arrows between related concepts
Auto-scrolls to keep you in sync

It's like having a private tutor sitting next to you, pointing at the page and saying "right here — this is the part that matters."

Why Linode / Akamai Cloud

LearnAloud isn't a toy demo — it's a latency-sensitive, compute-heavy pipeline that needed infrastructure that could actually keep up. I chose Linode Kubernetes Engine (LKE) because it gave me the flexibility to deploy each component as an independent, scalable service without fighting the platform to do it.

The workload breaks down into three distinct compute profiles:

Embedding + vector search — CPU-bound, bursty, scales horizontally with LKE node pools
PDF processing — memory-intensive per-request work, isolated in its own microservice
Voice pipeline coordination — low-latency orchestration layer where consistent network performance matters

Having these as separate LKE deployments meant I could scale the embedding service independently during heavy load without over-provisioning everything else. Linode's straightforward pricing made this actually practical to run.

Architecture on LKE

┌─────────────────────────────────────────────┐
│              Linode Kubernetes Engine        │
│                                              │
│  ┌──────────────┐    ┌───────────────────┐  │
│  │  PDF Service │    │  Embedding Service│  │
│  │  (PyMuPDF)   │───▶│  (OpenAI + FAISS) │  │
│  │  2 replicas  │    │  3 replicas       │  │
│  └──────────────┘    └───────────────────┘  │
│          │                    │              │
│          ▼                    ▼              │
│  ┌─────────────────────────────────────┐    │
│  │       Orchestration Service         │    │
│  │  (LLM calls + annotation triggers)  │    │
│  └─────────────────────────────────────┘    │
│          │                                   │
│          ▼                                   │
│  ┌──────────────┐                            │
│  │  React Frontend (served via LKE ingress)  │
│  └──────────────┘                            │
└─────────────────────────────────────────────┘

All services are fully open-source:

PyMuPDF for PDF parsing
FAISS for vector search
LangChain for RAG orchestration
LiveKit (open-source) for WebRTC voice
React frontend
Kubernetes manifests and Helm charts — fully reproducible deployment

The Hard Technical Problems

PDF coordinate extraction — Getting exact bounding boxes for text spans requires parsing PDF internals at the span level, not just extracting raw text. PyMuPDF's block/line/span hierarchy gave me pixel-accurate positions that feed directly into the annotation renderer.

Sub-200ms voice-visual sync — Any perceptible lag between the tutor's voice and the highlight breaks the experience completely. The solution: pre-compute annotation trigger timestamps inside the LLM response, buffer them against the audio timeline, and fire Client Actions slightly ahead of the corresponding speech. LKE's consistent network performance between services was critical here — variable inter-pod latency would have broken this entirely.

Stateful interruption handling — When a user asks a follow-up question mid-explanation, the annotation state pauses and resumes cleanly from the correct position. This required careful state management across the orchestration layer.

GPU-accelerated embedding (bonus) — The embedding service is deployed on a GPU-enabled Linode node pool, cutting batch embedding time for large PDFs by ~4x compared to CPU-only inference. This matters for the initial document ingestion step where the user is waiting.

Open Source & Reproducibility

Full source code, Kubernetes manifests, and a one-command deployment guide are available in the public GitHub repository. Anyone can fork this and deploy their own instance on LKE in under 15 minutes using the provided Helm chart.

What's Next

Edge deployment: Move the voice coordination layer to Akamai edge nodes to reduce STT/TTS round-trip latency for global users
Chrome extension: Bring LearnAloud to any web article without a PDF upload step
Multi-agent architecture: A Pedagogue agent that teaches + a Librarian agent that pulls in supporting research on demand, both running as separate LKE services
Mobile: Camera scan for physical textbooks

Why It Matters

Students and researchers spend enormous time fighting the gap between reading and understanding. LearnAloud collapses that gap by making the explanation inseparable from the document itself. Built open-source, deployed on LKE, and designed to scale — this is what modern AI-native applications should look like.

Built With

fastapi
linode
python
stt
tts

Updates

Santanu Saha started this project — Feb 20, 2026 12:37 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.