WorldGen

System Design

Inspiration

A lot of AI tools can describe a world, show a world, or animate a world. Very few let you actually step inside one.

We were inspired by a simple idea: someone should be able to type a simple sentence and immediately explore it as a 3D space.

There is also a deeper use case: fantastical worlds can be more than entertainment. They can be safe spaces for exploration, conversation, and confidence-building. We wanted to build something that feels magical, but also points toward a real future for interactive AI environments.

What it does

WorldGen turns a text prompt into a playable 3D world in the browser.

A user describes a setting in just a sentence or two. WorldGen plans the scene, assembles it from structured components, and renders it as a world you can move through.

Inside the scene, you can walk around, explore the environment, and talk to NPCs that exist in the world itself. These characters can speak, respond, move, and maintain memory across interactions.

The result is not just generated content. It is a place you can enter and explore.

How we built it

We built WorldGen as a full agentic 3D world pipeline. On the orchestration side, we used Python + FastAPI with OpenRouter-routed models including K2 Think V2 for scene planning, Gemini for structured world building, and Qwen3-VL-32B as a visual critic. Those models generate and refine a typed Scene DSL, which is then checked with Zod-based schemas and passed through a deterministic validator/compiler layer for asset checks, spatial rules, overlap detection, walkability, and scene assembly.

On the frontend, we used Next.js, React, TypeScript, React Three Fiber, and three.js to render the world in the browser, with Rapier for physics, recast-navigation-js for runtime navmesh generation, and Zustand for client state. For persistence and real-time sync, we used Convex.

For characters, we added OpenRouter-driven dialogue, Azure Speech SDK for STT, TTS, and viseme lip sync, plus a character generation pipeline using FLUX on fal.ai, Meshy for image-to-3D, and GLB facial rig post-processing. Everything is organized in a pnpm + Turborepo repository with shared scene schemas, asset registries, and world engine packages.

Challenges we ran into

The hardest part was keeping generation structured.

If you let models generate too freely, they start inventing assets, fields, and scene logic that do not actually exist. We had to design a tight scene format and validate everything before runtime.

Another challenge was navigation. Because scenes are built dynamically, we could not rely on pre-authored navmeshes. We had to generate walkable navigation data at runtime so NPCs could move correctly.

We also ran into a lot of issues getting our local ASUS GPU setup working reliably. There were multiple compatibility issues across the model serving stack, which made local multimodal inference and iteration more painful than expected.

Accomplishments that we're proud of

We are proud that WorldGen works as a full pipeline, not just as a mockup.

A user can type a prompt, get a generated 3D world, walk through it, and interact with characters inside it.

We are also proud of the core architecture. Instead of asking a model to directly generate an entire game scene, we built a system where models work over structure and the engine handles validation and rendering. That made the experience much more stable.

Most of all, we are proud that the output feels like a world, not just an image.

What we learned

We learned that the boundary between AI and deterministic systems matters a lot.

The model should handle imagination, planning, and refinement. The engine should handle structure, validation, and execution. That split made the whole system much more reliable.

We also learned that interactive world generation is very different from static media generation. Once the user can move, talk, and come back later, persistence, navigation, and embodiment become just as important as visual generation.

What's next for WorldGen

Next, we want to make the experience more seamless so world generation and world interaction feel like one continuous flow.

We also want to expand the set of world families, improve the quality of NPC behavior and memory, and add stronger visual critique using multimodal models.

Longer term, we want WorldGen to support richer persistent worlds, better character continuity, and more dynamic environments that evolve as the player explores.

Built With

azure
fal.ai
gemini
gpu
javascript
k2
meshy
next.js
openrouter
python
qwen
rapier
react
typescript
zod

Updates

Mihirt2 Tandon started this project — Mar 29, 2026 10:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.