Whiteboard architecture sessions are where critical design decisions happen — yet they rarely get expert review in real time. Flaws in security, scalability, or reliability often surface weeks later when fixing them is 10x more expensive. We asked: what if an AI senior architect could join every whiteboard session, watching and listening, just like a real colleague?

What it does

Whiteboard Architect is a real-time AI architecture reviewer. Point your camera at a whiteboard, talk through your design, and Archie — an AI senior cloud architect — provides instant voice feedback.

  • Sees your diagrams through the camera and understands them as you draw
  • Listens to your explanations via microphone
  • Responds with real-time voice feedback on security, scalability, reliability, cost, and operations
  • Interrupts naturally — barge-in support lets you cut in mid-sentence, just like a real conversation
  • Annotates your whiteboard with visual markers highlighting issues and components
  • Generates diagrams — converts hand-drawn sketches into clean SVG diagrams
  • Records findings as structured review notes with severity levels and recommendations

How we built it

We built a multi-model architecture using four Gemini models, each optimized for its task:

Model Role
Gemini 2.5 Flash Native Audio Live bidirectional voice + vision conversation
Gemini 3.1 Flash Lite Background whiteboard analysis & English→Japanese translation
Gemini 2.0 Flash SVG diagram generation from sketches

The backend (FastAPI + Google ADK) runs four parallel async tasks per WebSocket connection:

  1. Upstream — routes client audio/video to Gemini Live API via LiveRequestQueue
  2. Downstream — streams Gemini responses back to the client
  3. Recovery — monitors session health and detects false barge-ins
  4. Perception — runs periodic deep analysis of the whiteboard

The frontend (Next.js 16) uses custom React hooks for AudioWorklet-based capture (PCM16 16kHz), gapless audio playback (PCM 24kHz), and canvas-based video capture (JPEG @ 1fps).

Infrastructure is fully automated with Terraform deploying to Google Cloud Run, Firestore, and Cloud Storage.

Challenges we ran into

  • Live API tool-calling bug — The -12-2025 model variant triggers WebSocket 1008 errors when tools are invoked. We built an automatic model probing system that tests candidates at startup and falls back to working variants.
  • Barge-in responsiveness — Achieving natural interruption required a dual approach: server-side detection via Gemini Live API plus client-side RMS-based VAD to immediately clear audio buffers.
  • Language quality — The native audio model produces significantly better responses in English. Rather than forcing Japanese, we adopted an English-first approach with a dedicated translation service for bilingual display.
  • Background analysis coordination — Running a separate perception model alongside the live conversation required careful state management to inject analysis context naturally without disrupting the flow.

Accomplishments that we're proud of

  • Truly real-time multimodal interaction — audio, video, and text stream bidirectionally with minimal latency
  • Native barge-in powered by ADK + Live API — no custom interruption logic needed
  • A perception layer that continuously understands whiteboard structure and auto-generates visual annotations
  • Graceful degradation — the app works without any cloud services beyond a Gemini API key

What we learned

  • LiveRequestQueue is the key abstraction for multiplexing audio/video to Gemini while receiving responses simultaneously
  • ADK dramatically simplifies agent development — session management, tool execution, and Live API integration come built-in
  • Multi-model architectures outperform single-model approaches when tasks have different latency/capability requirements
  • English-first with translation produces better quality than forcing non-English in system prompts

What's next

  • Multi-language support beyond English/Japanese
  • Collaborative multi-user review sessions
  • Architecture Decision Record (ADR) generation from review notes
  • Historical snapshot comparison across sessions

Built With

  • audioworklet-api
  • fastapi
  • gemini-2.0-flash
  • gemini-3.1-flash-lite
  • gemini-live-api
  • google-adk
  • google-cloud
  • google-cloud-firestore
  • google-cloud-run
  • next.js
  • python
  • react
  • tailwind-css
  • terraform
  • typescript
  • websocket
Share this project:

Updates