website

SpatialLens - Bridging the Metric Spatial Gap in VLMs

Using brick dimensions as natural calibration to give Vision-Language Models the metric spatial understanding they lack.

The Problem

Vision-Language Models can see a brick wall, but they can't measure it. Ask GPT-5, Claude, or Gemini whether mortar joints in a construction image meet the 3/8" BIA specification, and they'll give you vague, often incorrect answers. This is a task any apprentice mason does by eye in seconds.

Why? Prior research reveals three compounding failures:

VLMs allocate only ~10% attention to image tokens during spatial tasks — they're barely looking (Chen et al., ICML 2025)
VLMs fail under occlusion, but structured coordinate context dramatically helps (CAPTURE, ICCV 2025)
VLMs lack metric spatial understanding — they can't estimate real-world distances from images (SpatialVLM, CVPR 2024)

Our Insight

Bricks are natural rulers.

A standard modular brick is always 2.25 inches tall. This is a known constant visible in every frame of masonry footage. By detecting bricks in an image, we can compute a pixels-per-inch calibration — no depth camera, no LiDAR, just geometry — and convert pixel measurements to real-world inches.

Brick height in pixels ÷ 2.25" = Pixels per inch
Mortar joint width in pixels ÷ Pixels per inch = Joint thickness in inches

We then inject these measurements as structured text context into the VLM prompt, giving it the metric spatial grounding it otherwise lacks.

Method

Frame -> Brick Detection (Gemini 2.5 Flash bounding boxes)
      -> Calibration (brick height px -> pixels/inch)  
      -> Joint Measurement (joint thickness px -> inches)
      -> Structured Context Injection (measurements as text)
      -> Enhanced VLM Assessment (image + measurements -> verdict)

Key design choice: We use the VLM itself for detection, then do simple geometry for calibration, then re-prompt the same VLM with injected measurements. No custom models, no training, no specialized hardware.

Results

Metric	Baseline (Raw VLM)	Enhanced (SpatialLens)
Gives specific verdict	X/N	X/N
Cites real measurements	X/N	X/N
Correct verdict (vs ground truth)	X/N	X/N

Key Findings

Baseline VLMs hedge and hallucinate measurements. When asked about mortar joints, they produce vague assessments ("joints appear roughly consistent") or fabricate numbers not grounded in actual pixel analysis.
Structured context injection produces specific, grounded analysis. With real measurements injected, the same model cites exact numbers, identifies specific problem joints, and gives actionable recommendations.
The brick-as-calibration principle generalizes. Any known-dimension object visible in frame (rebar, lumber, concrete blocks) can serve the same role.

Example Output

Baseline (Raw Gemini 2.5 Flash)

"The mortar joints appear to be relatively consistent... approximately 3/8 inch... overall the wall appears to be well-constructed. PASS"

Enhanced (SpatialLens)

"Based on the calibrated measurements, Joint 3 at 0.58" exceeds the 1/2" maximum tolerance (deviation: +0.205" from standard). Joints 1, 2, and 4 are within spec at 0.34-0.41". FAIL — recommend rework of row 3 mortar joint before proceeding."

Research Context

This work synthesizes three recent findings into a novel application:

Paper	Key Finding	How We Apply It
CAPTURE (ICCV 2025)	Coordinate context improves VLM spatial reasoning	We inject calibrated measurements as structured context
Chen et al. (ICML 2025)	VLMs under-attend to image tokens for spatial tasks	Our text-based measurements compensate for visual inattention
SpatialVLM (CVPR 2024)	VLMs lack metric distance estimation	Brick calibration provides the missing metric grounding

Novel contribution: We demonstrate that in-frame known-dimension objects (bricks) can serve as zero-cost calibration references, enabling metric spatial reasoning in VLMs without any additional sensors, training, or fine-tuning.

Limitations & Future Work

Detection accuracy depends on image quality and viewing angle
Currently assumes standard modular brick dimensions (could be extended to detect brick type)
Calibration is per-frame; temporal consistency across video is future work
The same principle applies to any construction element with known dimensions (rebar, lumber, steel beams)
API calls are spaced 15 seconds apart to stay within the Gemini free tier (10 RPM); paid tier removes this constraint