Anatroc
An AI tutor that watches your screen, listens to your voice, and teaches you in real time.
Inspiration
Learning physical and technical tasks — assembling hardware, following lab protocols, navigating complex codebases, configuring cloud architectures — has always required either an in-person instructor or passive video tutorials. Neither scales well. Video tutorials can't see what you're doing. They can't tell you that you wired the wrong pin, missed a step, or are staring at the wrong config file.
We asked: what if an AI could actually watch your screen, hear your questions, and coach you through tasks the way a real instructor would — live, with memory, and with visual aids?
That's Anatroc. A multimodal AI assistant that observes, remembers, and responds — creating an active feedback loop between the learner and the AI.
What It Does
Anatroc is a real-time learning companion that combines screen understanding, voice conversation, and visual diagram generation into a single live experience.
You share your screen. You speak. Anatroc watches, listens, and helps.
Here's what happens in a live session:
- Screen awareness — Anatroc captures your screen every few seconds, extracts text via OCR, and builds a semantic memory of everything you've been looking at.
- Voice conversation — You speak naturally. Anatroc responds with voice in real time through Nova Sonic speech-to-speech. It supports barge-in — you can interrupt the assistant mid-sentence, just like talking to a real person.
- Contextual memory — Ask "what was I reading 5 minutes ago?" or "summarize what I've been doing" and Anatroc retrieves past screen content and conversation turns from its vector memory to give you a grounded answer.
- Live diagram overlays — Say "generate an architecture diagram" and a Mermaid diagram appears as a floating overlay right on your screen. Move it, minimize it, export it, or remove it — all by voice.
- Text Q&A — Type a question about your screen content and get a retrieval-augmented answer powered by Nova Lite reasoning.
┌───────────────────────────────────────────────┐
│ Your Browser │
│ │
│ ┌──────────────┐ ┌────────────────────┐ │
│ │ Your Screen │ │ Nova Sync Panel │ │
│ │ (shared) │ │ │ │
│ │ │ │ Sonic Voice [ON] │ │
│ │ ┌─────────┐ │ │ OCR: "class Db…" │ │
│ │ │ Overlay │ │ │ Ask Nova: [____] │ │
│ │ │ Diagram │ │ │ Diagram: [____] │ │
│ │ └─────────┘ │ │ │ │
│ └──────────────┘ └────────────────────┘ │
│ │
│ [🎤 Mic] [📷 Camera] [🖥 Screen] [Leave] │
└───────────────────────────────────────────────┘
How We Built It
Anatroc is built entirely on Amazon Nova models running on AWS Bedrock, with a React frontend and a Python FastAPI backend.
Architecture
Browser (React + Vite)
│
┌─────────────┼─────────────┐
│ │ │
Stream Video WebSocket REST API
(call UI) (Sonic audio (frames,
+ overlays) queries)
│ │ │
└─────────────┼─────────────┘
│
FastAPI Server
(token_server.py)
│
┌─────────────┼─────────────┐
│ │ │
Nova Sonic Nova Lite Nova Embeddings
(speech ↔ (reasoning, (1024-dim vectors
speech) OCR, diagrams) for semantic search)
│ │ │
└─────────────┼─────────────┘
│
Aurora PostgreSQL
(pgvector)
│
Redis
(retrieval cache)
The Stack
| Layer | Technology | Why |
|---|---|---|
| Voice AI | Nova 2 Sonic | Real-time bidirectional speech with barge-in support. No transcribe-then-respond — true speech-to-speech. |
| Reasoning brain | Nova 2 Lite | Bedrock Converse API for screen analysis, OCR extraction, and Mermaid diagram generation. Multimodal — accepts text + images. |
| Semantic memory | Nova Multimodal Embeddings | 1024-dim vectors from screen text and conversation turns. Powers "what did I see earlier?" retrieval. |
| Vector store | Aurora PostgreSQL + pgvector | Cosine similarity search over session embeddings. IVFFlat index for fast retrieval. |
| Cache | Redis | 30-min TTL retrieval cache. Normalized question keys so similar queries hit the same entry. |
| Call UI | Stream Video React SDK | Handles WebRTC, camera, mic, and screen share. We focus on the AI — Stream handles the plumbing. |
| Frontend | React + TypeScript + Vite | Mic capture → 16kHz PCM16 → WebSocket. Screen capture → JPEG → REST ingest. Mermaid.js for overlay rendering. |
| Backend | Python + FastAPI + uvicorn | Async pipeline managing Sonic sessions, frame ingestion, retrieval, and Bedrock calls. |
How the pieces connect
- Frontend joins a Stream Video call and opens a WebSocket to our backend for Sonic voice.
- When screen share is active, the frontend captures a JPEG every 3–5 seconds and POSTs it to the backend.
- Nova Lite extracts OCR text from each frame. Nova Embeddings converts that text into a vector. Aurora pgvector stores it.
- When the user speaks, Nova Sonic handles the conversation. On memory-seeking questions, the backend retrieves relevant past context from Aurora (with Redis caching) and injects it into the Sonic stream.
- When the user asks for a diagram, Nova Lite generates Mermaid syntax from retrieved context, and the frontend renders it as a floating overlay.
- Voice commands like "move overlay to top right" or "remove overlay" are detected by keyword matching in the backend and translated into WebSocket control events.
Challenges We Ran Into
Making Sonic context-aware without adding latency. Nova Sonic is a streaming model — it works best when you just feed it audio and let it respond. But we needed it to know about the screen content and past conversation history. Injecting retrieval context on every turn would add noticeable delay. We solved this with intent-gated retrieval: context is only fetched and injected when the user asks memory-seeking questions (keywords like "summarize", "earlier", "recap"), with a cooldown to prevent retrieval storms.
Async PostgreSQL on Windows. Aurora's psycopg async connections don't play well with Python's ProactorEventLoop on Windows. We had to route Aurora operations through blocking calls in worker threads (asyncio.to_thread) to keep the development experience smooth without switching OS.
Keeping Sonic conversational while overlays are being generated. Diagram generation involves retrieval + Lite reasoning, which takes a couple seconds. We had to carefully order the turn-end processing — overlay intent is handled before the Sonic turn is closed, ensuring the assistant can still verbally acknowledge the request while the diagram renders asynchronously.
Front-end audio engineering. Capturing browser mic audio, downsampling to 16kHz PCM16, streaming over WebSocket, and playing back Sonic's response audio — all while handling barge-in (clearing the playback queue when the user starts talking) and echo suppression (muting mic briefly after assistant finishes). Getting this pipeline reliable across Chrome and Firefox required careful buffer management.
Model ID normalization. Nova on-demand model IDs (amazon.nova-2-lite-v1:0) don't work with the Converse API — you need inference profile IDs (us.amazon.nova-2-lite-v1:0). This was a silent failure that took time to diagnose. We built automatic normalization into our config layer so it never bites again.
Accomplishments That We're Proud Of
Nova Sonic + Nova Embeddings + Aurora pgvector + Redis working together in one live voice experience. This isn't just a chatbot — it's a voice-first AI that can see your screen, remember what you did, and generate visual aids, all in real time.
True barge-in support. You can interrupt the assistant mid-sentence. The playback queue clears instantly, mic suppression kicks in to prevent echo, and the conversation continues naturally.
Voice-controlled diagram overlays. "Generate architecture diagram" → floating Mermaid overlay appears. "Move it to top right" → it moves. "Export diagram" → copied to clipboard. All hands-free. This is the kind of interaction that makes people stop and say "wait, it can do that?"
Graceful degradation everywhere. Redis down? System continues without cache. Aurora writes disabled? Synthetic IDs returned, pipeline keeps running. Screen share stopped? Voice continues. We built resilience into every layer.
Session memory that actually works. Ask "summarize what I've been doing" during a 20-minute session, and you get a real answer grounded in OCR snapshots and past conversation turns — not a hallucination.
What We Learned
Nova Sonic is genuinely different from transcribe-then-respond. The latency characteristics and conversational feel of true speech-to-speech changes what's possible in real-time AI interactions. Barge-in isn't a feature you bolt on — it's native to the streaming model.
Retrieval-augmented voice is harder than retrieval-augmented text. In a text chat, you can afford 2 seconds of retrieval latency. In voice, that's a noticeable pause. Intent gating, aggressive caching, and bounded context injection are essential — not optional.
pgvector + Nova Embeddings is a powerful combination. Cosine similarity search over 1024-dim Nova embeddings in Aurora is fast and accurate enough for real-time retrieval. The IVFFlat index makes it scale without noticeable query-time degradation.
Frontend audio is an iceberg. The visible part: "capture mic and play audio." The hidden part: sample rate conversion, PCM16 encoding, silence detection thresholds, playback queue management, echo suppression timing, AudioContext resume policies, and browser compatibility quirks.
Cost control should be designed in, not patched on. Frame sampling intervals, retrieval cooldowns, bounded context sizes, Redis caching — these aren't afterthoughts. Without them, a 30-minute demo session could rack up significant Bedrock costs.
What's Next for Anatroc
Session rolling summaries. Instead of retrieving raw past turns, periodically compress the session into a running summary. Better context quality, lower retrieval cost.
Smarter intent classification. Move beyond keyword matching for retrieval triggers and overlay commands. Use a lightweight classifier to understand user intent more accurately.
Multi-overlay support. Stack multiple diagrams, side-by-side comparisons, and todo lists — all voice-controllable and spatially arranged.
Hand-drawn overlay interaction. Let users draw on the overlay with their mouse or touchscreen, then ask the assistant for feedback on what they drew. Bridging voice + visual + spatial input.
Document and PDF ingestion. Upload reference materials before a session. The assistant can cite specific pages and sections when answering questions.
Collaborative sessions. Multiple users in the same call, with shared screen context and overlays. The assistant coaches the group, not just one person.
Log in or sign up for Devpost to join the conversation.