Voice-First Visual Thinking Canvas (Gemini 3 Hackathon)

What inspired me

I kept thinking about how hard “getting started” can be for kids who struggle with fine motor control, handwriting, or just expressing ideas on paper. A blank page is intimidating. And if drawing is physically hard, imagination can get stuck behind the hand.

So I wanted to flip it:

Let the child talk first and the canvas follows.
No pressure to be perfect. No “I can’t draw.” Just: say it, see it appear, then gently tweak it.

That’s where this project came from: a voice-first visual thinking canvas where spoken ideas turn into a scene, step by step, like building a story.

What I built

A desktop-web app that lets you:

speak a scene (“draw a sky… put a house on the left… add a tree… move it a bit…”)
see the canvas update live
edit objects by voice (move, resize, recolor, delete)
preview new objects as “ghosts” before committing them
export the final image and the underlying scene JSON

Under the hood, it’s built around one idea:

The Scene Graph is the single source of truth

Everything on the canvas is represented as structured data (JSON), not pixels. That means edits are reliable, reversible, and testable.

How I built it (the architecture)

1) A strict drawing language (vector primitives)

At the start, I was tempted to hardcode “tree”, “house”, “mountain”… but that scales badly. So I moved to a vector-primitive vocabulary:

rect, circle, ellipse, line, polyline, polygon, path, text
grouped into objects with transforms (x/y/scale/rotation)
layered (sky / background / ground / foreground / preview overlay)

This was a big turning point: once the app can render primitives generically, Gemini can describe *anything* using those primitives , not just a fixed list of objects.

2) Structured outputs: Gemini returns JSON commands, not chat text

I don’t want “maybe draw a tree”. I want tool-driving commands.

So Gemini 3 outputs a CommandEnvelope (JSON) that includes drawing commands like:

add_preview_object
update_preview_object
commit_preview_object
add_object
update_object
delete_object

And the app refuses to apply anything unless it validates.

3) Deterministic state engine (replayable)

I built the reducer so the whole session is replayable:

command history → reconstruct scene exactly

That made undo/redo reliable and eliminated “weird canvas drift” bugs.

4) Server-authoritative state + WebSockets

The server holds the canonical scene state and broadcasts updates to all clients. That made debugging and multi-tab behavior way more stable (and it’s demo-friendly).

5) Voice → text pipeline (always-on)

This was the hardest part.

I tried to rely on Gemini-native live voice early on, but in practice it wasn’t stable enough for my “always-on short commands” requirement.

I tested:

Deepgram (worked, but accuracy/segmentation wasn’t reliable for directional words like left/right/beside)
and finally switched to Picovoice Cheetah for the live transcription pipeline, which gave me much more consistent results for short command phrases.

6) Latency fixes with caching

Once the pipeline worked, the next enemy was latency.

The biggest win was caching:

the system prompt
the JSON schema
the examples / formatting rules

So each turn doesn’t re-send the same heavy instruction payload. That kept live drawing feeling responsive.

What I learned (the real lessons)

✅ “Hardcoding is the enemy of creativity”

At first, my “tree/house/mountain templates” were too rigid. The more I hardcoded, the more the system broke when the user asked for something slightly different.

Switching to primitives made the system open-ended.

✅ Schema discipline beats clever prompting

The system only became stable when I treated the schema like a contract:

validate everything
reject anything malformed
never “half apply” commands
keep preview vs committed rules strict

✅ Preview workflows are a superpower

The “ghost object” workflow solved a real UX problem:

users often don’t know exactly what they want until they see it
so the system should propose something safe, let them tweak it, then commit

This turned voice drawing into a smooth conversation instead of a one-shot gamble.

✅ Determinism matters more than I expected

Even with AI in the loop, the app needs to behave like engineering, not magic. Deterministic replay made bugs debuggable and made the experience consistent.

Challenges I faced (and how I solved them)

1) Vector shapes didn’t “look right” at the beginning

Early shapes felt off (proportions, layering, composition). The fix wasn’t “more art” — it was better structure:

clearer primitive rules
tighter bounds (0–100 grid)
better default composition logic (horizon band, depth scaling, anchoring)

2) JSON outputs weren’t always schema-valid

Even good models sometimes return slightly wrong field names or inconsistent shapes. My solution:

enforce strict validation
add normalization/repair only when it’s deterministic and safe
otherwise refuse + ask for clarification

3) Composition and object placement

A scene isn’t just objects — it’s relationships:

“house on the left”
“tree beside it”
“sun behind mountains”
“path winding forward”

I had to teach the system to behave like a layout director, not just a generator:

avoid overlaps
keep depth believable
scale objects by layer and horizon distance

4) Voice STT accuracy and segmentation

This nearly killed the project.

For short commands, small mistakes ruin everything:

“left” ↔ “right”
“tree” ↔ “three”
merged utterances

Deepgram worked, but didn’t consistently match my command style. Pico Cheetah ended up being more reliable for this specific use case.

5) Latency (the “feels live” problem)

Even if the model is correct, slow feels broken. Caching the schema/prompt + keeping scene summaries compact was the difference between:

cool tech demo and
actually usable live experience

A bit of math (because my canvas is a logical world)

I use a normalized logical coordinate system:

(x, y \in [0, 100])

Mapping to screen pixels is:

[ x_{px} = \frac{x}{100} \cdot W,\quad y_{px} = \frac{y}{100} \cdot H ]

This made layout rules consistent across any screen size, and it made “move it slightly left” predictable.

What I’d do next (if I had more time)

Better “clarify which object?” UX (selection + disambiguation prompts)
Stronger realism/styling pass (more painterly textures + consistent strokes)
Multi-speaker / classroom mode (teacher guiding, student watching)
Accessibility polish: spoken confirmations, reduced cognitive load UI, bigger transcript overlay

Closing thought

The main thing I’m proud of is that this isn’t “AI draws one picture.”
It’s a conversation with a canvas, where the system remembers, edits, and stays consistent — and where voice becomes a real input method for visual thinking.

That’s the direction I want: make creativity accessible even when drawing by hand is difficult.

Built With

contracts
express.js
gemini
genai
konva.js
node.js
typescript/javascript;
vite
vue
websockets

Updates

Shaugato paroi started this project — Feb 09, 2026 12:55 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.