Inspiration Most presenters have two fears: forgetting what to say and looking like they forgot what to say. Note cards break eye contact. Teleprompters cost thousands. Rehearsing the same talk twenty times just to feel confident is a tax on your time that never fully pays off.
We wanted to see if a laptop webcam and a pair of earbuds could replace all of that. No hardware. No stage setup. Just you, your slides, and a quiet voice in your ear.
What it does AirDeck is a hands-free presentation copilot. Hold your hand up and swipe right to advance a slide, swipe left to go back, pinch to zoom, or point to drop a laser highlight your audience can see on the projected screen. No clicker required.
As each slide appears, Claude reads the slide content and whispers a concise one or two sentence cue into your earpiece via text-to-speech. Just enough to keep you on track without sounding scripted. The audience hears nothing.
How we built it The client is React + TypeScript + Vite. MediaPipe's HandLandmarker model runs at ~30fps off the main thread, feeding 21 hand landmarks per frame into a pure-function gesture detector layer and a finite state machine that classifies swipes, pinches, and points while enforcing cooldown windows so the same gesture never double-fires.
Slide control goes through a Chrome extension that lives in the same browser session as the Google Slides tab. The AirDeck client sends a postMessage; the extension relays it as an ArrowRight or ArrowLeft keydown into the Slides tab. It needs only activeTab and scripting permissions.
The backend is Node + TypeScript. When a slide changes, the client hits a /api/cue/stream SSE endpoint. The server calls Claude via the Claude Agent SDK query() function with maxTurns: 1, no tools, and a tight system prompt:
"You are a teleprompter whispering into the presenter's earpiece. Write the exact words the presenter should say out loud — one or two natural spoken sentences they can repeat verbatim."
We stream tokens back over SSE as they arrive and speak them using the browser's SpeechSynthesis API routed to the presenter's earpiece. A lookahead cache prefetches the next slide's cue the moment the current slide loads, so by the time the presenter swipes, the cue is already waiting.
The end-to-end latency target is 700 to 1200ms from gesture to first audio.
$$ t_{total} = t_{gesture} + t_{slide} + t_{cue} + t_{tts} $$
$$ \approx 50\text{ms} + 100\text{ms} + \underbrace{0\text{ms}}_{\text{prefetched}} + 300\text{ms} \approx 450\text{ms} $$
When the cue is prefetched, we comfortably beat the target.
Challenges we ran into Gesture false positives. Normal hand movement mid-sentence looks a lot like a slow swipe. We solved this with a rolling velocity window that evicts stale position samples older than 300ms, a mirror-flip correction so left/right match the user's perspective rather than raw camera space, and a calibration flow that tunes thresholds to each presenter's natural hand speed.
Slide advance with no official API. Google Slides has no programmatic "next slide" method. We went through a few approaches before landing on a Chrome extension that injects keydown events into the active Slides tab. It works reliably and requires minimal permissions, but it does mean the presenter needs to install the extension.
Audio routing. SpeechSynthesis doesn't support setSinkId, so we can't programmatically route audio to a specific output device. For now the presenter sets their earpiece as the system default audio output before the talk. Neural TTS providers like ElevenLabs would fix this but add latency and cost.
Keeping cue latency near zero. The naive path is: slide changes → request cue → wait for Claude → speak. Even on a fast model that's 400 to 600ms of dead air. The lookahead cache closes most of this gap by generating slide N+1's cue while the presenter is still on slide N.
Accomplishments that we're proud of Getting gesture detection to feel trustworthy was harder than expected and matters more than any other part of the system. A false swipe mid-sentence is worse than no swipe at all. The FSM with per-gesture cooldowns and detector priority ordering made the gesture layer feel solid.
The SSE streaming pipeline is clean. Tokens from the Claude Agent SDK flow directly into the SSE response and then into the speech queue with almost no buffering logic in between.
The latency budget math actually worked out. The prefetch cache means the common case is near-instant.
What we learned The Claude Agent SDK's query() with settingSources: [] is the right call for a latency-sensitive path — it skips loading any project config and gets straight to the model. Keeping maxTurns: 1 and allowedTools: [] matters too; every option you strip is latency you recover.
Browser SpeechSynthesis is surprisingly good for a zero-cost baseline. The voices are natural enough on macOS that most presenters won't notice the difference unless they're comparing directly to ElevenLabs.
Gesture UX is product design, not just engineering. The cooldown durations, the velocity threshold, which gesture gets priority — those decisions shape whether the system feels like a tool or a liability. You can't unit test your way to good defaults. You have to use it in front of a room.
What's next for AirDeck Neural TTS with a streaming provider (ElevenLabs or OpenAI) for voice quality and proper setSinkId device routing Pre-talk mode where Claude generates richer notes for the full deck before the presenter walks on stage, using a more capable model Laser overlay that maps the fingertip landmark to screen coordinates with a smoothed trail the audience can follow PPTX and PDF support so the tool isn't locked to Google Slides Offline calibration profiles so frequent presenters don't recalibrate every session
Built With
- chrome-extensions-mv3
- claude-agent-sdk
- eslint
- google-identity-services
- google-slides-api
- mediapipe-tasks-vision
- node.js
- prettier
- react
- server-sent-events
- typescript
- vite
- vitest
- web-speech-api
Log in or sign up for Devpost to join the conversation.