vima.

Inspiration

frontier VLMs can look at a construction site, but they still struggle to understand what happened in space over time.

we tested this on real hardhat/bodycam-style masonry footage. a raw model can usually say "worker," "scaffold," or "concrete block wall." but when you ask questions like:

was masonry work happening near the wall?
what changed between these frames?
was the worker productive, contributory, or non-contributory?
where did activity happen in the site?

the answers get shaky fast. the model guesses, loses temporal context, and has no persistent memory of what it just saw.

VIMA is our attempt to fix that without training a new foundation model. instead of asking a VLM to reason from raw video alone, we build an auditable spatial evidence layer first.

What it does

VIMA turns egocentric construction footage into structured spatial memory.

the system processes construction video into:

frame-level CII productivity labels: productive, contributory, non-contributory
object detections from construction frames
optional semantic boxes from Gemini Robotics-ER
merged object boxes
SAM-style masks
depth estimates
spatial zones from COLMAP-style camera pose clustering
event episodes with timestamps, evidence frames, tracks, depth facts, and retrieval text
cited answers from retrieved memory

in the current demo, VIMA analyzes real masonry footage and exposes the evidence through a hosted dashboard, API, CLI, and MCP server.

headline demo results:

30 sampled masonry frames
26 productive frames
4 non-contributory frames
86.7% wrench time
0.939 mean confidence on productive frames
118 temporal episodes in the frontend evidence workspace
VLM-only spatial score: 0.600
VLM + memory spatial score: 0.792
+33.2% improvement across 5 spatial questions

How we built it

the core idea is evidence first, answer second.

hardhat footage
  -> sampled frames
  -> object boxes
  -> optional Gemini Robotics-ER boxes
  -> box merge
  -> masks
  -> depth
  -> object-event episodic memory
  -> retrieved evidence
  -> cited VLM answer

we built several layers:

1. backend spatial API

the backend is a FastAPI service with endpoints for health checks, frame analysis, CII summaries, frame-level evidence, spatial zones, and temporal evals.

important endpoints include:

GET /health
POST /analyze/frame
GET /cii/summary
GET /cii/frames
GET /spatial/zones
GET /eval
POST /temporal/run

2. CII productivity classification

VIMA classifies each sampled frame using the construction industry institute style categories:

P: productive
C: contributory
NC: non-contributory

the demo run produced 26 productive frames out of 30, which gives 86.7% wrench time.

3. spatial memory pipeline

we use object boxes, masks, and depth estimates to build object-event memory. the checked-in demo memory includes 10 frame-based episodes and 35 frame events, including:

masonry work candidates
scaffold zone visibility
safety edge context
foreground worker presence
material staging visibility

4. temporal reasoning eval

we compare raw single-frame reasoning against memory-augmented reasoning. the memory version gives the model before/after context, evidence frames, and structured event claims instead of asking it to infer changes from one frame at a time.

our checked-in eval shows:

VLM-only:      0.600
VLM + memory: 0.792
improvement:  +33.2%
questions improved: 5 / 5

5. frontend and agent interfaces

we built a Next.js frontend with:

landing page
live demo workspace
frame ledger
CII stats
temporal proof page
review interface
depth / segmentation / reconstruction visuals

we also built a portable vima-agent CLI and MCP server so other agents can inspect the same evidence programmatically.

6. mobile field companion

we also built a mobile-first companion app for the people actually closest to the jobsite.

the mobile flow turns VIMA's scene ledger into quick verification cards. a worker or supervisor can swipe through claims like "masonry work continues along the open wall edge" or "worker at elevated masonry edge with no fall protection," then confirm, reject, or skip each claim.

we connected that flow to a lightweight Solana incentive layer: verified work and claim review can earn raffle progress, and the prototype includes SOL reward states plus a payout queue for wallet addresses. the intent is to make verified field feedback feel immediate while keeping the actual evidence trail tied back to VIMA's spatial ledger.

the prototype includes:

swipe-based claim review
xp, levels, and streaks for completed verification
raffle unlocks after repeated verified claims
a lottery wheel for xp or SOL rewards
payout queue flow for a wallet address
Capacitor config so the web prototype can become a native mobile shell

the point is not to make construction work feel like a game for its own sake. the point is to close the loop: VIMA produces spatial claims from video, and the mobile app gives field teams a fast way to verify or dispute those claims while keeping the audit trail alive.

what makes it different

most construction vision demos stop at detection or classification.

VIMA is focused on auditable spatial memory.

instead of saying:

"the worker is doing masonry."

VIMA can point to the supporting episodes, frames, tracks, and spatial facts that led to the answer.

for example, when asked:

was there masonry work happening near the wall?

the memory-backed Gemini answer retrieved 4 relevant episodes and cited specific frame evidence showing workers near a concrete block wall. it also preserved uncertainty by calling them "masonry_work_candidate" events instead of overstating the claim.

that evidence trail is the whole point. construction teams do not just need a confident answer. they need a checkable one.

Challenges we ran into

construction footage is messy as hell.

the camera moves constantly, objects are partially visible, lighting changes, workers block each other, and the same wall can appear from several different angles. raw frame-level VLM answers were often plausible but ungrounded.

we also had to decide what story mattered most. the repo has several working pieces: CII productivity, spatial zones, temporal memory, masks, depth, agent tooling, and a Solana-based reward concept. the hard part was turning that into one coherent product:

VIMA builds spatial memory from hardhat footage, then uses that memory to make construction video answers auditable.

another challenge was making the demo practical. some local model paths, like Qwen-VL, were not reliable enough to depend on for the live flow, so we focused the demo around the artifacts and services that worked consistently.

Accomplishments that we're proud of

we got a real end-to-end system working across backend, frontend, evaluation, and agent surfaces.

we are proud of:

the hosted API and dashboard at vimaspatial.tech
30-frame masonry CII evidence with 86.7% wrench time
temporal memory with 118 frontend episodes
an A/B eval showing +33.2% improvement from memory augmentation
object-event memory with masks, depth, tracks, relations, and cited retrieval
a CLI and MCP server so agents can query the same evidence as humans
a demo that shows not just model output, but the evidence behind it

What we learned

the biggest lesson is that the model is not always the missing piece. sometimes the missing piece is memory.

VLMs are strong at recognizing objects, but construction intelligence needs more than object recognition. it needs temporal state, spatial grounding, and evidence that can be inspected later.

we also learned that uncertainty is valuable. a system that says "this looks like masonry work, and here are the frames that support that claim" is more useful than a system that confidently invents progress.

What's next for vima

next, we would extend VIMA into a fuller construction intelligence layer:

run the robotics box pipeline across many frames, not just selected frames
use registered COLMAP camera poses for production zone attribution
improve multi-visit change detection across different walkthroughs
add stronger object tracking over longer videos
expand the eval set beyond the current 5-question A/B comparison
make the dashboard a full evidence browser for frames, boxes, masks, depth, episodes, and final answers
connect VIMA to construction workflows for progress tracking, safety review, and productivity reporting