CastleForge

Inspiration

We wanted to see if you could bridge natural language and Minecraft building. Instead of spending hours placing blocks manually, just describe what you want and watch it appear in front of you.

What it does

CastleForge lets you type a command like /build a medieval castle in Minecraft and a structure gets generated and placed in the world in real time. It runs a retrieval system under the hood that finds the closest matching 3D voxel model to your prompt, converts it to Minecraft blocks using a perceptual color matching pipeline, and animates the placement live in game. There's also a web UI with a 3D preview before it gets built.

How we built it

Three pieces working together. A Fabric mod that intercepts the /build command and handles animated block placement. A FastAPI backend that encodes the prompt with CLIP, queries a FAISS index of 123k voxel embeddings from the BlockGen-3D dataset, and converts the result to a schematic using an Oklab color matching pipeline. And a Next.js frontend with a Three.js 3D preview.

Challenges we ran into

Getting the diffusion model to produce meaningful output in 24 hours was not realistic. We trained a 3D U-Net on the full 515k sample dataset but outputs collapsed to black concrete blobs, so we pivoted to retrieval which actually works. Coordinate transforms between voxel space and Minecraft world space also caused a lot of pain.

Accomplishments that we're proud of

The full pipeline actually works end to end. You type a prompt, something gets placed in Minecraft. The color matching using Oklab is genuinely good and the block variety from the JAR extracted palette looks way better than we expected. The animated placement is a nice touch.

What we learned

Low training loss does not mean good generation. We learned that the hard way. CLIP retrieval is underrated for this kind of problem and honestly produces more reliable results than a half trained diffusion model would have.

What's next for CastleForge

Swapping the retrieval step for actual diffusion model inference. We have the training pipeline running on the full dataset right now and the architecture is already wired to drop in as a replacement. Better prompts, bigger palette, and structure scaling beyond 32x32x32 are all on the list.

Built With

blockgen-3d-dataset
clip-(vit-b/32)
colour-science
fabric
faiss
fastapi
huggingface-datasets
java
mcschematic
minecraft
nbtlib
next.js
python
pytorch
react
scikit-learn
tailwind-css
three.js
typescript

Updates

Aryan Kumar started this project — Feb 22, 2026 11:55 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.