Turner

Inspiration

Every presenter knows the awkward moment — fumbling with a clicker, passing it between teammates, or struggling to hold it while demoing something hands-on. Keynote and Google Slides offer scripted recording, but that's rigid and time-restricted. Real presentations drift: Q&A sessions interrupt flow, a client asks you to jump back three slides, or you want to freestyle beyond your script. Turner was born from that frustration — the gap between how presentations should feel and how they actually go.

What it does

Turner is a fully touchless browser plugin for live presentations. It lets you:

Navigate slides with hand gestures captured by your webcam
Jump semantically to any slide by speaking naturally — say a topic, and Turner matches it to the right slide using LLM embeddings, no exact keyword required
Control flow with voice commands via real-time speech transcription
Monitor everything on a live dashboard showing current slide state, transcript feed, and gesture input

No clicker, no remote, no hands occupied. Just you and your audience.

How we built it

Turner is a multi-pipeline system running locally:

Gesture layer: MediaPipe hand and pose landmarkers detect swipe, point, and twist gestures from a webcam feed in real time
Voice layer: AWS Transcribe streams live microphone input, producing finalized transcript segments
Semantic layer: Google Gemini processes your slide deck (PDF) and speaker script, building per-slide semantic context. As you speak, finalized transcript chunks are matched against this context to decide whether to advance, hold, or jump to a specific slide
Slide control: PyAutoGUI sends keystrokes to control Keynote or PowerPoint running natively on your machine
Dashboard: A Next.js frontend connects to a FastAPI + SSE backend, providing live visibility into slide state, camera/microphone selection, and debug controls

Challenges we ran into

Latency in the semantic loop: AWS Transcribe only finalizes segments after a pause, creating a natural lag. We decoupled transcript intake from the Gemini retry path so slow LLM responses don't block transcript capture.
Gesture false positives: Distinguishing intentional navigation gestures from natural presenter movement required careful tuning of the swipe and twist detectors, and extensive unit testing against edge cases.
Semantic context caching: Building per-slide embeddings from a full PDF + script on first run is slow. We added pickle caching so subsequent runs on the same deck are instant.
Cross-process slide control: Reliably sending keystrokes to Keynote or PowerPoint from a Python background process required careful handling of macOS focus and permissions.

Accomplishments that we're proud of

A fully working end-to-end pipeline: speak a topic mid-presentation, and Turner jumps to the right slide in seconds — no clicking, no searching
Hot-swappable camera and microphone selection through the live dashboard, without restarting the system
Modular architecture that cleanly separates gesture, voice, semantic, and control layers — each independently testable
A robust unit test suite covering swipe detection, body twist, hand geometry, and the gesture controller

What we learned

Real-time AI pipelines require aggressive decoupling — any synchronous dependency between layers creates visible stalls for the user
Semantic slide matching is surprisingly robust even with informal, wandering speech — the LLM handles paraphrasing and topic drift well
Gesture recognition in presenter environments (variable lighting, background movement, expressive body language) is a much harder problem than controlled demos suggest
Building for a real use case forced us to think about reliability and graceful degradation over raw demo polish

What's next for Turner

Virtual laser pointer: highlight and annotate slides in real time using finger-tracking, no hardware required — partial implementation already in progress
Presenter profiles: learn each presenter's personal gesture vocabulary and speaking style, reducing false positives and eliminating interruptions
Enterprise customization: configurable gesture sets and semantic sensitivity for large pitch events, boardrooms, and conference stages
Expanded platform support: beyond Keynote and PowerPoint to Google Slides and browser-based presentation tools.