Inspiration
Version control works perfectly for code — but completely breaks for 3D scenes.
When a .blend file changes, Git just says:
binary file changed
It doesn’t tell you:
- what changed
- where it changed
- or why the scene now looks different
In real production workflows (games, architecture, VFX), artists constantly tweak lighting, camera position, and geometry together. A building might be scaled up while the camera moves back, making the final render look identical — but Git treats it as an opaque binary change.
We realized the real problem wasn’t diffing files. It was understanding intent and perception in 3D scenes.
So we asked:
What if scene diffs worked like pull requests for 3D?
That idea became MeshMerge.
What it does
MeshMerge is an AI-powered scene diff engine that explains why a 3D scene changed — not just what changed.
Given two .blend files, MeshMerge:
- Extracts structured scene data (objects, transforms, lights, camera)
- Computes deterministic structural diffs
- Performs a visual image diff
- Detects ambiguous situations (camera vs geometry, lighting vs material)
- Uses Gemini to reason about causal relationships
- Generates:
- a Git-style changelog
- a semantic JSON report
- a visual heatmap
- a PDF review report
It can identify cases like:
- Object scaled but camera moved → visual size unchanged
- Lighting changed but materials same → perceptual shift
- Multiple objects moved together → parent transform
- Scene looks different but no geometry changed → camera/lighting cause
MeshMerge essentially adds a reasoning layer to 3D version control.
How we built it
MeshMerge is a modular multimodal pipeline built with Python and Blender automation.
1. Scene Extraction
We used Blender’s Python API (bpy) to export structured scene data:
- object transforms
- bounding boxes
- mesh stats
- lights
- camera
We also render viewport images for visual comparison.
2. Deterministic Diff Engine
A structural diff compares:
- transforms
- scale
- mesh stats
- materials
- object additions/removals
This produces a ground-truth change list.
3. Visual Diff
We run an image diff using NumPy + Pillow to detect changed regions and generate heatmaps.
This tells us where the scene looks different.
4. Vision Correlation
We project object bounds into screen space and correlate structural changes with visual regions to confirm which changes are perceptually visible.
5. Ambiguity Detection Layer
This is where things get interesting.
We detect cases where deterministic logic cannot decide the cause:
- camera vs geometry
- lighting vs material
- perceptual-only changes
- cascading transforms
These are passed as hypotheses.
6. Gemini Reasoning Engine
Gemini acts as a causal inference layer.
It receives:
- structural diffs
- visual regions
- ambiguity hypotheses
- camera depth metrics
- scene JSON
- before/after viewport images
It resolves contradictions and produces a structured semantic report explaining the scene changes.
7. Output Generation
Finally we generate:
- CHANGELOG.md (human-readable)
- semantic_scene_report.json
- annotated images
- PDF report
Challenges we ran into
1. 3D → 2D projection problem Mapping object bounds to screen space without a full renderer was tricky. We built a heuristic projection system to approximate visual overlap.
2. Camera vs geometry ambiguity A scaled object and a moved camera can cancel each other visually. Detecting this required computing camera distance deltas and reasoning about apparent size.
3. Multimodal reasoning We didn’t want Gemini to just summarize data — it needed to resolve contradictions. Designing the prompt + schema to force causal reasoning took several iterations.
4. Blender automation reliability Running Blender headless across environments required careful scripting and path handling.
5. Report generation Keeping long text contained in PDFs and ensuring visuals aligned correctly required custom layout handling.
Accomplishments that we're proud of
- Built a full end-to-end pipeline from
.blend→ AI changelog - Created a system that explains perceptual vs structural changes
- Successfully detected camera vs scale compensation scenarios
- Generated human-readable scene diffs automatically
- Produced visual + semantic reports for review workflows
- Designed a clean, reproducible pipeline for judges to run locally
Most importantly:
We turned a binary file diff into an explainable scene narrative.
What we learned
- Multimodal reasoning works best when deterministic systems feed structured context
- LLMs are powerful when resolving ambiguity, not just summarizing
- 3D version control is a real unsolved problem in creative pipelines
- Clear schema design dramatically improves AI reliability
- Visual + structural data together unlock much richer insights
We also learned how to design AI systems that:
- admit uncertainty
- explain decisions
- and justify conclusions
What's next for MeshMerge
Short term
- Depth-accurate projection instead of heuristic projection
- Parent-child scene graph detection
- More robust ambiguity classification
- Side-by-side diff UI
Mid term
- Blender plugin for real-time scene diff
- GitHub integration for
.blendpull requests - Timeline diff for animation sequences
- Scene merge conflict detection
Long term vision
MeshMerge becomes:
“GitHub for 3D scenes”
Where artists can review scene changes like code diffs:
- visual
- semantic
- explainable
Final Thought
MeshMerge isn’t just a diff tool.
It’s a scene reasoning engine.
It answers the question:
Why does this scene look different?
And that’s something traditional version control has never been able to do.
Log in or sign up for Devpost to join the conversation.