Voice-First Visual Thinking Canvas (Gemini 3 Hackathon)
What inspired me
I kept thinking about how hard “getting started” can be for kids who struggle with fine motor control, handwriting, or just expressing ideas on paper. A blank page is intimidating. And if drawing is physically hard, imagination can get stuck behind the hand.
So I wanted to flip it:
Let the child talk first and the canvas follows.
No pressure to be perfect. No “I can’t draw.” Just: say it, see it appear, then gently tweak it.
That’s where this project came from: a voice-first visual thinking canvas where spoken ideas turn into a scene, step by step, like building a story.
What I built
A desktop-web app that lets you:
- speak a scene (“draw a sky… put a house on the left… add a tree… move it a bit…”)
- see the canvas update live
- edit objects by voice (move, resize, recolor, delete)
- preview new objects as “ghosts” before committing them
- export the final image and the underlying scene JSON
Under the hood, it’s built around one idea:
The Scene Graph is the single source of truth
Everything on the canvas is represented as structured data (JSON), not pixels. That means edits are reliable, reversible, and testable.
How I built it (the architecture)
1) A strict drawing language (vector primitives)
At the start, I was tempted to hardcode “tree”, “house”, “mountain”… but that scales badly. So I moved to a vector-primitive vocabulary:
rect,circle,ellipse,line,polyline,polygon,path,text- grouped into objects with transforms (
x/y/scale/rotation) - layered (
sky / background / ground / foreground / preview overlay)
This was a big turning point: once the app can render primitives generically, Gemini can describe *anything* using those primitives , not just a fixed list of objects.
2) Structured outputs: Gemini returns JSON commands, not chat text
I don’t want “maybe draw a tree”. I want tool-driving commands.
So Gemini 3 outputs a CommandEnvelope (JSON) that includes drawing commands like:
add_preview_objectupdate_preview_objectcommit_preview_objectadd_objectupdate_objectdelete_object
And the app refuses to apply anything unless it validates.
3) Deterministic state engine (replayable)
I built the reducer so the whole session is replayable:
command history → reconstruct scene exactly
That made undo/redo reliable and eliminated “weird canvas drift” bugs.
4) Server-authoritative state + WebSockets
The server holds the canonical scene state and broadcasts updates to all clients. That made debugging and multi-tab behavior way more stable (and it’s demo-friendly).
5) Voice → text pipeline (always-on)
This was the hardest part.
I tried to rely on Gemini-native live voice early on, but in practice it wasn’t stable enough for my “always-on short commands” requirement.
I tested:
- Deepgram (worked, but accuracy/segmentation wasn’t reliable for directional words like left/right/beside)
- and finally switched to Picovoice Cheetah for the live transcription pipeline, which gave me much more consistent results for short command phrases.
6) Latency fixes with caching
Once the pipeline worked, the next enemy was latency.
The biggest win was caching:
- the system prompt
- the JSON schema
- the examples / formatting rules
So each turn doesn’t re-send the same heavy instruction payload. That kept live drawing feeling responsive.
What I learned (the real lessons)
✅ “Hardcoding is the enemy of creativity”
At first, my “tree/house/mountain templates” were too rigid. The more I hardcoded, the more the system broke when the user asked for something slightly different.
Switching to primitives made the system open-ended.
✅ Schema discipline beats clever prompting
The system only became stable when I treated the schema like a contract:
- validate everything
- reject anything malformed
- never “half apply” commands
- keep preview vs committed rules strict
✅ Preview workflows are a superpower
The “ghost object” workflow solved a real UX problem:
- users often don’t know exactly what they want until they see it
- so the system should propose something safe, let them tweak it, then commit
This turned voice drawing into a smooth conversation instead of a one-shot gamble.
✅ Determinism matters more than I expected
Even with AI in the loop, the app needs to behave like engineering, not magic. Deterministic replay made bugs debuggable and made the experience consistent.
Challenges I faced (and how I solved them)
1) Vector shapes didn’t “look right” at the beginning
Early shapes felt off (proportions, layering, composition). The fix wasn’t “more art” — it was better structure:
- clearer primitive rules
- tighter bounds (0–100 grid)
- better default composition logic (horizon band, depth scaling, anchoring)
2) JSON outputs weren’t always schema-valid
Even good models sometimes return slightly wrong field names or inconsistent shapes. My solution:
- enforce strict validation
- add normalization/repair only when it’s deterministic and safe
- otherwise refuse + ask for clarification
3) Composition and object placement
A scene isn’t just objects — it’s relationships:
- “house on the left”
- “tree beside it”
- “sun behind mountains”
- “path winding forward”
I had to teach the system to behave like a layout director, not just a generator:
- avoid overlaps
- keep depth believable
- scale objects by layer and horizon distance
4) Voice STT accuracy and segmentation
This nearly killed the project.
For short commands, small mistakes ruin everything:
- “left” ↔ “right”
- “tree” ↔ “three”
- merged utterances
Deepgram worked, but didn’t consistently match my command style. Pico Cheetah ended up being more reliable for this specific use case.
5) Latency (the “feels live” problem)
Even if the model is correct, slow feels broken. Caching the schema/prompt + keeping scene summaries compact was the difference between:
- cool tech demo and
- actually usable live experience
A bit of math (because my canvas is a logical world)
I use a normalized logical coordinate system:
- (x, y \in [0, 100])
Mapping to screen pixels is:
[ x_{px} = \frac{x}{100} \cdot W,\quad y_{px} = \frac{y}{100} \cdot H ]
This made layout rules consistent across any screen size, and it made “move it slightly left” predictable.
What I’d do next (if I had more time)
- Better “clarify which object?” UX (selection + disambiguation prompts)
- Stronger realism/styling pass (more painterly textures + consistent strokes)
- Multi-speaker / classroom mode (teacher guiding, student watching)
- Accessibility polish: spoken confirmations, reduced cognitive load UI, bigger transcript overlay
Closing thought
The main thing I’m proud of is that this isn’t “AI draws one picture.”
It’s a conversation with a canvas, where the system remembers, edits, and stays consistent — and where voice becomes a real input method for visual thinking.
That’s the direction I want: make creativity accessible even when drawing by hand is difficult.
Built With
- contracts
- express.js
- gemini
- genai
- konva.js
- node.js
- typescript/javascript;
- vite
- vue
- websockets
Log in or sign up for Devpost to join the conversation.