Inspiration
Every kid has monsters in their head, the bully at school, the dark hallway at night, the feeling that they're not good enough. I wanted to build a game where a child doesn't just play against their fears, they draw them, name them, and shout them down, and Gemini 3 brings it all to life in real-time.
The idea started with a question: what if the rules of a game weren't pre-authored, but generated at runtime from the player's own imagination? Not "AI in a game," but a game where AI defines the mechanics, including what winning means.
Beat-em-ups are visually rich but structurally simple, the perfect canvas for this idea. Every sprite, every background, every boss personality could be a generation target. I saw Gemini 3's native image generation, structured output, and the Live multimodal API and realized I could build something that can't exist as traditional software: a game where the player names a fear, sketches a hero, picks a boss persona, and presses Play, and Gemini compiles all of that into a playable world with its own rules for how to win.
I drew inspiration from Undertale's dialogue-driven personality, the sketch-to-life magic of Drawn to Life, and the cathartic punch-fest of Streets of Rage, fused with the idea that confronting your anxieties can be empowering when you control the narrative.
What it does
Defeat the Darkness is a browser-based beat-em-up where every visual asset and the victory conditions themselves are generated in real-time by Gemini 3.
A child sketches their hero on an HTML5 canvas. They name their greatest fear. They pick a boss personality. At that moment, Gemini doesn't just generate art. It compiles the player's abstract input into a structured Boss Spec: the monster's archetype, ego drivers, hidden weak spots, what it can shrug off, its signature comebacks, and the semantic conditions for victory. That spec is locked for the run and becomes the rules of the game.
Then Gemini generates the world to match: a hero sprite sheet from the sketch, enemies born from the fear, an arena from the description, and a boss that embodies the chosen persona.
The core mechanic proves it's AI-native: the final boss has an ego shield that can't be broken with punches. To win, the player has to fracture the boss's ego by understanding what it values and exposing the mask, by talking into the microphone. Gemini Live evaluates each insult against the run's spec (targeting the weak spots, coherence, in-world logic, PG-13 safety constraints) and returns a structured function call that updates game state deterministically.
Example: if the player picks "trust fund elitist," the boss becomes "Kaiser the Silver Spoon Snob." Punching won't work. The player wins by calling out his insecurities: "Daddy's money bought your suit but not your talent." Gemini scores that as a critical hit (19-30 damage) because it targets the generated weak spot. When the ego shield shatters, the boss becomes physically vulnerable and the player can finish the fight with their fists.
The meta-message is simple: you don't beat fears by overpowering them. You beat them by understanding and reframing them.
Full loop:
- Sketch your hero -> Gemini 3 Pro generates a 4x3 animated sprite sheet from your drawing
- Design your weapon -> Gemini composites it into a powered-up "super" sprite
- Name your fear -> Gemini generates enemy sprites that embody what scares you
- Choose your battlefield -> Gemini creates a side-scrolling pixel-art arena
- Pick the boss's personality -> Gemini builds a structured JSON spec (archetype, weak spots, shrug-offs, comebacks, critical hints)
- Fight 3 waves -> Punch through 18 enemies across escalating waves with a super meter mechanic
- Face the Final Boss -> A real-time voice battle via Gemini Live where you fracture the boss's ego, then finish him with your fists
Every playthrough is unique. No pre-made sprites. No canned dialogue. No scripted boss fights. No pre-authored victory conditions.
How I built it
Three Gemini 3 capabilities working together as one pipeline:
1. Native Image Generation (Gemini 3 Pro)
Five Vercel serverless endpoints transform player input into game-ready pixel art:
- Hero & Super sprites: Player sketch -> 4x3 sprite sheet with walk, attack, idle, hurt, defeated, and respawn frames
- Enemy & Boss sprites: Text description of a fear -> animated enemy sprite (boss variant is scaled up with enhanced detail)
- Backgrounds: Location description -> side-scrolling arena with enforced walkable zones
All image generation uses responseModalities: ['TEXT', 'IMAGE'] with specific aspect ratios (16:9 for sprites, 21:9 for backgrounds).
2. Structured JSON Output (Gemini 3 Flash)
Used in two critical systems:
LLM-as-a-Judge: Every generated sprite and background is validated by a second Gemini call that scores layout correctness, animation consistency, and art style. Failed assets (score < 70) trigger automatic retry with a cap of 3 attempts before falling back to pre-made assets. This creates a self-correcting pipeline: generate -> validate -> retry -> fallback.
Boss Spec Generation: Given the player's chosen personality, Gemini returns a structured JSON object via responseMimeType: 'application/json' with a responseSchema enforcing exact fields: archetype, voiceStyle, weakSpots[], shrugOffs[], signatureComebacks[], and criticalHints[]. This spec feeds directly into the Live session's system prompt, so the boss's personality, vulnerabilities, and victory conditions are all dynamically generated and locked for the run.
3. Gemini Live API (Real-Time Multimodal)
The final boss fight runs on a live WebSocket session configured with the generated Boss Spec:
- The browser streams microphone audio to Gemini Live (16kHz PCM via
ScriptProcessorNode) - The boss responds in real-time with audio playback (24kHz gapless queue)
- Gemini evaluates each insult against the spec's weak spots and shrug-offs, then calls
apply_ego_damage(amount, playerText, moderation)as a structured function call - Damage is tiered deterministically: weak (1-8), solid (9-18), critical (19-30), or blocked for PG-13 violations
- The boss's ego shield shrinks visually as mental HP decreases
- When mental HP hits 0, the boss becomes physically vulnerable to punches
- Connection resilience: exponential backoff reconnection (1s -> 2s -> 4s -> cap 10s)
Architecture:
- React 18 + TypeScript frontend with a custom HTML5 Canvas sprite engine
- Vite 5 build system
- Vercel Serverless Functions for all Gemini API calls
- Parallel asset generation during the interview flow, by the time the player finishes answering questions, assets are already generated and validated
- Client-side + server-side content safety filters with safe fallback assets
Challenges I ran into
Sprite sheet consistency was my biggest technical hurdle. Gemini's image generation doesn't always produce correctly formatted 4x3 grids, frames might be misaligned, inconsistently sized, or stylistically divergent across the row. I solved this by building a two-layer validation system: a deterministic pixel-level check (correct dimensions, grid alignment) followed by an LLM-as-a-Judge call that scores the sprite as a whole. Failed sprites retry up to 3 times before falling back to a pre-built asset pack, so the game never breaks.
Additionally, I also initially tried to correct the generated images using the LLM as a judge to recommend the corrections. But my theory is that, as the image becomes more complex where every frame has it's own intricacies, it becomes more complex for Nano-banana to keep a track of the changes and will just continue to hallucinate and generate worser results consistently. So it's just easier to cut our losses and attempt regeneration by providing a reference image of what the output should look like.
Audio streaming for the boss fight was harder than expected. Getting reliable 16kHz PCM capture from browser microphones, streaming it over WebSocket to Gemini Live, and playing back 24kHz boss audio without gaps or pops required a custom audio queue system and careful buffer management. I also had to handle reconnection gracefully, the boss needs to "remember" the conversation context when the WebSocket drops.
Making the boss feel like a character, not a chatbot. The Live session system prompt has to do a lot of heavy lifting: it gives the boss a name ("Kaiser"), a specific personality derived from the generated spec, explicit weak spots and shrug-offs, and strict behavior rules (always call the damage function, keep responses to 1-2 sentences, stay in character). Getting that balance right between free-form conversation and deterministic game state updates took many iterations.
Prompt engineering for pixel art took dozens of iterations. Getting Gemini to consistently produce retro-styled, transparent-background sprite sheets with distinct animation frames, from a child's rough sketch, required very specific aspect ratio constraints, detailed frame-by-frame descriptions in the prompt, and negative prompts to avoid photorealistic output.
Content safety was critical given the game's audience (children confronting fears). I implemented multi-layer filtering: client-side word lists, server-side prompt sanitization, Gemini's built-in safety settings, and a blocked_pg13 moderation flag in the damage function that zeroes out damage for inappropriate content. Safe fallback assets catch any generation that trips a filter.
Accomplishments that I'm proud of
It's genuinely AI-native. This isn't a traditional game with AI features bolted on. The victory conditions themselves are generated by Gemini. What counts as a "critical hit" changes every run based on the boss spec the model compiled from the player's input. You can't pre-author this game because the rules don't exist until the player defines their fear and Gemini interprets it.
The "wow" moment works. When a kid draws a stick figure and watches it transform into an animated pixel hero that walks, punches, and powers up, that reaction is everything. That creative loop where AI feels like a partner, not a tool, is exactly what I set out to build.
Three Gemini capabilities, one seamless pipeline. Image generation, structured output, and the Live API aren't used in isolation. They form a pipeline where each capability flows into the next. The structured boss spec (Flash) feeds directly into the Live session. The LLM judge (Flash) validates the image generator's (Pro) output. It all compounds.
The boss fight is genuinely interactive. Thanks to the Live API's function calling, the boss doesn't just listen. It evaluates each insult against the spec's explicit weak spots, scores damage deterministically, and responds in character, all in real-time over voice. It feels like arguing with a real character, not talking at a speech-to-text API.
The self-correcting pipeline actually works. Generate -> validate -> retry -> fallback means the game never shows a broken sprite. The LLM judge catches bad outputs that would break the game engine, and the fallback system guarantees a playable experience every time.
What I learned
LLM-as-a-Judge is underrated for generative pipelines. Using a fast model (Gemini Flash) to validate a powerful model's (Gemini Pro) output creates a reliable self-correcting loop. This pattern, generate, judge, retry, could apply to any application where model output needs to meet structural constraints.
The Live API's function calling changes game design. Being able to have Gemini evaluate conversational context and trigger structured game actions (damage amounts, moderation flags) in real-time over a voice channel opens up mechanics that simply weren't possible before. The boss fight isn't speech-to-text plus rules, it's a single model that understands the conversation, scores it against a spec, and updates game state in one atomic operation.
Structured output is the bridge between creative AI and deterministic systems. The boss spec generation is the linchpin of the whole project. Without
responseSchemaenforcing exact fields, I'd have no reliable way to feed a generated personality into a Live session's system prompt. Structured output turns Gemini's creativity into something a game engine can consume.Parallel generation is essential for UX. If I had waited for each asset to generate sequentially, the loading screen would have been 60+ seconds. Running generation concurrently during the interview flow makes the wait almost invisible.
Content safety isn't optional when kids are involved. Multi-layer filtering (client -> server -> model -> fallback) is table stakes for any generative experience aimed at younger users.
What's next for Defeat-your-darkness
- Multiplayer co-op - Two players sketch their heroes and fight side by side, each with their own AI-generated character
- Dynamic difficulty - Use Gemini to analyze player performance mid-game and adjust enemy count, speed, and boss difficulty in real-time
- Story mode - Gemini generates a narrative arc across multiple boss fights, with each boss themed to a different fear the player describes
- Mobile support - Touch controls + mobile mic for boss fights, making the game accessible on tablets where kids are most likely to play
- Community gallery - Players can share their generated heroes and boss encounters, creating a community of AI-generated content
Log in or sign up for Devpost to join the conversation.