Inspiration
I kept running into the same wall when trying to build 3D worlds. Professional tools have brutal learning curves. Voxel builders limit you to blocks. AI generators produce flat images you can't walk through. I wanted to see what would happen if you could describe the things in your world — a pirate ship, a snow fox, an entire medieval village — and have them materialize as real 3D objects you can pick up, rearrange, and explore.
What it does
elsewhere is a 3D world-building studio powered by Gemini 3. You type what you want — "a cartoon dragon with green scales" — and Gemini generates a real 3D asset built from geometric primitives. It shows up in your library. You drag it into your world, scale it, rotate it, place it next to the castle you made five minutes ago.
The core loop is: Generate, Arrange, Iterate.
- Generate — Describe any object. Gemini 3 Flash outputs a structured schema (materials, geometry, hierarchy), then a deterministic compiler turns that schema into Three.js code. No textures, no imported models — everything is primitives, which gives a consistent low-poly aesthetic.
- Arrange — Place assets on a 400m x 400m terrain grid across five biomes. Sculpt elevation, transform objects, scale from tiny props to massive structures.
- Iterate — Regenerate, request variations, or modify assets in place ("make it red," "add wings"). The system applies schema transforms and shows a before/after preview.
Theme Packs let you describe a world theme like "haunted carnival," and Gemini plans a coherent set of 6-10 assets — characters, buildings, props, nature — that all belong together. One prompt populates an entire library.
Scene Generation goes further. Describe a scene — "a quiet fishing village at dawn" — and Gemini plans the full composition, generates every asset, and places them. Then the system captures multi-angle screenshots and sends them back to Gemini's vision API for evaluation. Gemini looks at what it built, scores the layout, identifies problems ("the fountain clips through the stall," "the trees are all the same scale"), and proposes fixes. The system applies those fixes and loops — up to five iterations or until scores plateau. The AI reviews its own work visually, the same way you'd step back from an arrangement to see if it looks right.
How we built it
Preact + Three.js + Vite, with Gemini 3 as the backbone. Almost every user-facing feature routes through the Gemini API — the app makes four distinct types of Gemini call, each using a different combination of the model's capabilities.
Structured text generation with system instructions is the foundation. Every Gemini call uses a systemInstruction to set the model's role — one for asset schema planning, a different one for scene composition, another for evaluation, another for refinement. This means the same gemini-3-flash-preview model acts as a 3D designer, a spatial planner, a visual critic, and a layout editor across a single generation run, just by swapping system instructions. For asset creation specifically, Gemini receives the user's prompt and outputs a JSON schema — materials, parts, geometry types, parent hierarchies — which a deterministic compiler turns into Three.js code. The LLM never writes executable code directly; it describes what it wants in a constrained format, and deterministic code handles the how.
Multi-image vision is what makes scene generation work. After assets are placed in the 3D world, three screenshots are captured — bird's-eye overview, ground-level at eye height, orthographic top-down — and packed into a single API call as inlineData parts alongside a text prompt. Gemini sees all three angles simultaneously and returns a structured quality evaluation: an overall score, a list of specific spatial issues, and concrete corrections. Each camera angle catches different problems — the overview reveals composition gaps, the ground view catches scale mismatches, and the top-down exposes clipping and overlap. This multi-image input is the core of the refinement loop; without it, the system has no way to judge its own spatial output.
Thinking budget control shapes how the model reasons. Gemini 3's thinkingConfig lets you set how much internal reasoning the model does before responding. For asset schema generation and scene planning, thinking is set to zero — the tasks are well-constrained by the system instruction and don't benefit from extended reasoning, so disabling it cuts latency significantly. A single scene generation run makes 10-15 Gemini calls across planning, asset creation, evaluation, and refinement; keeping thinking off for the structured-output calls is what makes the whole loop feel interactive rather than glacial.
The renderer is a full Three.js pipeline with shadow-mapped lighting, post-processing (color grading, vignette, SMAA), and frustum-culled animation updates.
Challenges we ran into
Getting Gemini to produce usable 3D geometry was hard. Early versions asked the model to write Three.js code directly — wildly inconsistent results. The schema compiler was the breakthrough: constrain output to a strict JSON vocabulary, and failure modes become predictable and fixable.
Spatial reasoning was the hardest problem. I spent a decent amount of time trying to get Gemini to place assets at precise coordinates through text alone. Coordinates would overlap, distances were wrong, scale ratios were nonsensical. I stopped asking the model to reason about space through numbers and I let it see the result instead. Gemini's vision capabilities turned out to be pretty good at "that tower is way too small" or "there's a gap in the southeast corner" — spatial judgments that text-based reasoning couldn't handle.
Accomplishments that we're proud of
Typing "a steampunk octopus riding a unicycle" and watching it appear as a 3D object you can pick up and place in a world. The schema compiler makes this reliable because Gemini only describes geometry in a constrained format, and the compiler handles the wiring.
The scene generator's generate, see, correct loop feels like a real use of multi-modality. Gemini plans a scene through text, builds it, photographs it from three angles, evaluates what it sees, fixes the problems, and re-evaluates. That's not a gimmick — it's the only reason spatial composition works at all.
What we learned
Constrain the AI's output format before going down a prompt engineering rabbit hole. Shrink the space of possible outputs until failure modes are finitely countable and easy to handle.
Use the right modality for the job. Text generation is great for structured planning — JSON schemas, asset hierarchies. Text is terrible for spatial evaluation. Switching from "describe what's wrong with the layout" to "here are three camera angles, what do you see?" changed everything. Multi-modality isn't about using every capability; it's about matching each sub-task to the modality where the model is strongest.
System instructions are underrated. A single model can play four or five different roles in one pipeline if you give it the right system instruction for each call. This turned out to be more effective than trying to craft one giant prompt that covers planning, evaluation, and refinement all at once. Each system instruction constrains the output format and evaluation criteria for that specific phase — the model is never confused about what it's supposed to be doing.
Gemini 3 Flash was the right call. The instinct is to reach for the biggest model, but for structured JSON output, Flash is fast, cheap, and proved to be very reliable. Disabling thinking for structured-output calls and keeping it available for vision evaluation was a meaningful optimization — it cut per-call latency without sacrificing quality where reasoning actually matters.
What's next for elsewhere
The scene generator's vision loop is the right foundation — next is richer evaluation prompts that understand narrative and mood, so a "battlefield aftermath" feels different from a "peaceful garden" in density, spacing, and sight lines. Sharing and collaboration would turn elsewhere from a tool into a community. The long-term goal is a platform where anyone can build and share 3D worlds without touching a single piece of 3D software.
Built With
- cloud
- docker
- gemini
- indexeddb
- javascript
- preact
- three.js
- vite

Log in or sign up for Devpost to join the conversation.