Inspiration
We wanted to see if you could bridge natural language and Minecraft building. Instead of spending hours placing blocks manually, just describe what you want and watch it appear in front of you.
What it does
CastleForge lets you type a command like /build a medieval castle in Minecraft and a structure gets generated and placed in the world in real time. It runs a retrieval system under the hood that finds the closest matching 3D voxel model to your prompt, converts it to Minecraft blocks using a perceptual color matching pipeline, and animates the placement live in game. There's also a web UI with a 3D preview before it gets built.
How we built it
Three pieces working together. A Fabric mod that intercepts the /build command and handles animated block placement. A FastAPI backend that encodes the prompt with CLIP, queries a FAISS index of 123k voxel embeddings from the BlockGen-3D dataset, and converts the result to a schematic using an Oklab color matching pipeline. And a Next.js frontend with a Three.js 3D preview.
Challenges we ran into
Getting the diffusion model to produce meaningful output in 24 hours was not realistic. We trained a 3D U-Net on the full 515k sample dataset but outputs collapsed to black concrete blobs, so we pivoted to retrieval which actually works. Coordinate transforms between voxel space and Minecraft world space also caused a lot of pain.
Accomplishments that we're proud of
The full pipeline actually works end to end. You type a prompt, something gets placed in Minecraft. The color matching using Oklab is genuinely good and the block variety from the JAR extracted palette looks way better than we expected. The animated placement is a nice touch.
What we learned
Low training loss does not mean good generation. We learned that the hard way. CLIP retrieval is underrated for this kind of problem and honestly produces more reliable results than a half trained diffusion model would have.
What's next for CastleForge
Swapping the retrieval step for actual diffusion model inference. We have the training pipeline running on the full dataset right now and the architecture is already wired to drop in as a replacement. Better prompts, bigger palette, and structure scaling beyond 32x32x32 are all on the list.
Built With
- blockgen-3d-dataset
- clip-(vit-b/32)
- colour-science
- fabric
- faiss
- fastapi
- huggingface-datasets
- java
- mcschematic
- minecraft
- nbtlib
- next.js
- python
- pytorch
- react
- scikit-learn
- tailwind-css
- three.js
- typescript
Log in or sign up for Devpost to join the conversation.