Inspiration
As generative AI moves into 3D, creating 3D environments from text prompts is becoming a reality. However, current generation models often suffer from "Janus artifacts"—where an object makes sense from one angle but completely falls apart geometrically when viewed from behind. We wanted a way to systematically evaluate and score the spatial coherence of these generated 3D worlds, taking the manual QA out of the 3D generation pipeline.
What it does
Auto-Eval3D is an end-to-end evaluation pipeline that tests the structural consistency of 3D Gaussian Splats. You simply enter a prompt (like "Cozy medieval library"). The system hits the World Labs Marble API to generate the 3D scene, then automatically spawns a camera exactly in the center of the generated mesh.
A custom Three.js WebGL capture rig then automatically takes over, rotating the camera internally to capture four distinct 90° perspectives (Front, Right, Back, Left). Finally, it ships these multi-view screenshots to Google Gemini 2.5 Pro, which analyzes the images and generates a Spatial Coherence Score (out of 10) along with detailed reasoning about any geometric collapses or artifacts it observed.
How we built it
We architected a clean separation of concerns:
- Backend: A Python FastAPI server that acts as our proxy and orchestration layer. It handles the World Labs API polling (including rate-limit backoffs) and manages the multimodal request formulation for Gemini 2.5 Pro using the Vertex AI SDK. It also uses SQLite to store historical evaluation runs.
- Frontend: A premium, dark-mode vanilla JS frontend featuring a glassmorphism UI. We integrated the
<gaussian-splat>Web Component and built a customrequestAnimationFrameautomation rig in Three.js to hijack the OrbitControls, position the camera perfectly within the Y-down coordinate space, and extractcanvas.toDataURL()base64 images seamlessly.
Challenges we ran into
Integrating cutting-edge 3D model formats with standard web viewers was mathematically challenging. The generative model outputs scenes using the COLMAP coordinate system (Y is down), while our standard rendering pipeline (Three.js) expects Y to be up! This resulted in our scenes initially rendering completely upside-down and the automated camera rig spinning around the outside of the mesh looking into empty space.
We had to implement a local-to-world transformation matrix, rotating the splat mesh 180° around the X-axis while dynamically recalculating the geographic bounding box center so the automated camera could correctly "step inside" the world. Additionally, managing camera render loops that clashed with OrbitControls updating asynchronously required a delicate isCapturing state overlay.
Accomplishments that we're proud of
We successfully built a robust, fully automated "inside-out" camera rig. Being able to take a raw, untested 3D generation, drop a camera directly into its heart, automatically rotate it 360°, and get a robust reasoning response back from an LLM in under 60 seconds is an incredible technical achievement. The UI also looks extremely polished and production-ready for a hackathon timeline!
What we learned
We learned a massive amount about the mathematics of 3D coordinate spaces, Euler rotations, bounding box calculations, and how to effectively prompt multimodal Large Language Models (like Gemini 2.5 Pro) to understand structural geometry across multiple 2D image perspectives.
What's next for Auto-Eval3D
Currently, the World Labs API has strict rate limits. Next, we would implement full asynchronous Webhooks instead of HTTP polling, allowing us to evaluate fleets of hundreds of generations in parallel. We'd also love to integrate ACES tone mapping to make the viewer even more beautiful and create a global leaderboard of the most "spatially coherent" 3D prompts!
Built With
- css
- fastapi
- google-cloud
- google-gemini
- html
- javascript
- python
- sqlite
- three.js
- vertex-ai
- world-labs-marble-api
Log in or sign up for Devpost to join the conversation.