Inspiration
Game development has traditionally been like carving stone: rigid, hard-coded, and frozen once shipped. If you want to change a rule or a visual asset, you have to rewrite code, recompile binaries, and push updates. This "wall of code" stops creative people from building multiplayer worlds.
We wanted to build something "liquid." We asked: What if the game engine wasn't a static binary, but a living conversation?
We were inspired by Dungeons & Dragons, where a Dungeon Master can invent a rule or describe a new monster in seconds. We wanted to bring that level of flexibility to digital gaming. We envisioned an engine where the "CPU" isn't a silicon chip running compiled logic, but Gemini 3, processing natural language to simulate reality in real-time.
What it does
Clay is the world's first "Native Multimodal Game Engine." It treats the entire game state—from the sprite sheet to the rulebook—as a multimodal context window.
- Instant Studio: You type a prompt (e.g., "A sci-fi chess game on a space station"), and Clay generates the entire game from scratch: the background, the sprites, the collision maps, and the rules.
- AI Referee: It doesn't write code; it is the code. When you move a character, Gemini checks the rules in real-time to decide if the move is valid, acting as an impartial Dungeon Master.
- Real-Time "God Mode": This is our killer feature. Because the game is just language and images, players can rewrite reality mid-game. You can say, "Turn the floor into lava," and the engine will instantly generate new lava graphics, hot-swap the background, and update the rulebook to include fire damage—all without restarting the session.
How we built it
We moved beyond "Text-to-Code" and built a "Text-to-State" architecture. Please view the README to see in details.
The Tech Stack:
- Runtime: Gemini 3 Flash (for speed) and Pro (for complex reasoning).
- Backend: Convex (for real-time database syncing and function execution).
- Frontend: React (for the visual interface).
The "Instant Studio" Pipeline:
To solve the problem of AI asset consistency, we built a 6-step pipeline that works backwards:
- The Scene Agent: Generates one single image of the entire game first to ensure style consistency.

- The Set Designer: Surgically removes characters to create the background.
- The Casting Director: Extracts and cuts out characters into transparent sprites.
- The Vision Agent: Scans the sheet to identify objects and assign them coordinate data.

- The Cartographer: Analyze the background to draw a "NavMesh" (walkable grid) automatically.
- The Architect: Compiles all this data into a JSON game state.
The "Sandwich" Architecture:
We separated Intent (what the user wants to do) from Execution (what happens). When a user drags a unit, we send that intent to Gemini. Gemini acts as the logic layer, processing the intent against the current rule set and returning a state update.
Challenges we ran into
- Visual Hallucination & Scale: Early on, generating assets one by one resulted in mismatched styles (e.g., a realistic knight in a cartoon castle).
- Solution: We developed the "Scene-First" approach. By generating the final screenshot first and then deconstructing it, we guaranteed perfect lighting and scale consistency every time.
- Latency vs. Logic: Using an LLM as a game loop is risky.
- Solution: We optimized heavily with Gemini 3 Flash. We also minimized the context window by keeping the state as lean JSON, ensuring turns process in near real-time.
- Sprite Extraction: Getting clean cutouts of characters from a generated image is difficult. We had to chain multiple vision tasks to "clean" the background artifacts from the sprites.
Accomplishments that we're proud of
- True "God Mode": Successfully hot-swapping a game's background and ruleset in the middle of a multiplayer match without crashing the client.
- No Hard-Coded Logic: We successfully built a playable game where physics and combat are defined purely by text strings. If you write "Knights can jump walls," the game actually lets them do it immediately.
- The "Liquid" Feel: We achieved a workflow where building a game feels as fluid as playing it.
What we learned
- LLMs make excellent State Machines: We learned that with the right prompting, Gemini is incredibly good at maintaining complex game states and enforcing rules impartially.
- Multimodality is the future of Game Dev: Treating images and text as a single data stream allows for workflows that traditional engines (Unity/Unreal) simply cannot replicate.
- Prompt Engineering is Game Design: In Clay, writing a good prompt is game design. We learned how to structure prompts to create balanced, fun mechanics rather than just random chaos.
What's next for Clay Game Engine
- Complex Physics Agents: Implementing a physics agent that can calculate trajectories and collisions for more dynamic gameplay.
- 3D Support: Expanding the "Scene-First" pipeline to generate 2.5D or fully 3D assets.
- Voice Control: integrating Gemini Live to allow players to shout commands like "Fireball!" to cast spells.
- Public Gallery: Building a platform for users to share their "prompts" so others can fork and remix their game worlds.
Built With
- convex
- cv
- gemini
- ml
- natural-language-processing
- next.js
- pixijs
- react
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.