Inspiration
We were inspired by a simple gap in today’s computer vision systems. Modern models can recognize objects in images, but they struggle to understand what is actually happening in the physical world over time. In domains like construction, logistics, and field operations, value comes from understanding activity, movement, delays, and coordination, not just detecting tools or people in a single frame.
The hackathon prompt around spatial intelligence and world models pushed us to think beyond frame level perception. We wanted to build a system that reasons about how work unfolds in space and time and turns unstructured video into operational insight.
This led to the idea for SiteSense: a system that models real world activity from egocentric video and explains where productivity is gained or lost.
What it does
We built SiteSense, an end to end spatial intelligence pipeline that ingests egocentric construction video and produces structured world models of activity. The system segments video into temporal states such as WORKING, TRANSIT, IDLE, and DOWNTIME, computes productivity metrics and rankings, and generates time of day heatmaps and daily reports.
Beyond coarse activity segmentation, SiteSense selectively applies a vision language model to long idle periods to infer likely blockers such as waiting on equipment, coordination delays, or task setup issues. This refinement step allows the system to move from perception to semantic reasoning while remaining compute efficient.
The final output is a dashboard and report that explain what happened on site, when productivity peaked, and what factors likely slowed work down.
How we built it
We designed the system as a staged, modular architecture to balance speed, stability, and semantic depth. Video is sampled into frames and encoded using zero shot visual embeddings. These embeddings are compared against a small descriptor bank of activity states to produce coarse per frame predictions. Motion features and temporal smoothing are applied to produce stable activity segments.
Only selected segments such as long idle or pause periods are passed to a vision language model for refined labeling and blocker analysis. This selective policy lets us allocate GPU compute where deeper understanding adds the most value instead of running expensive models on every frame.
Each stage is cached and independently testable, which allowed us to iterate quickly during the hackathon while keeping the full pipeline reproducible. The analytics and reporting layers then aggregate segment level information into metrics, rankings, heatmaps, and daily summaries, which are visualized in Streamlit.
Challenges we ran into
One major challenge was balancing semantic richness with runtime and GPU budget. Running a vision language model on every frame was prohibitively slow and unstable for an end to end system. We addressed this by designing a selective refinement policy that only triggers deeper reasoning on long idle or ambiguous segments.
Another challenge was producing stable temporal segments from noisy frame level predictions. Raw zero shot outputs fluctuate significantly across frames, so we implemented temporal smoothing and segment aggregation to make the outputs usable for analytics and reporting.
Finally, designing a modular pipeline under hackathon time constraints required careful tradeoffs between flexibility and simplicity. We focused on clear stage boundaries so individual components could be swapped or tuned without breaking the system.
Accomplishments that we're proud of
We are proud that we delivered a fully working end to end system rather than a single model or isolated demo. SiteSense runs on real egocentric video, produces stable temporal segments, and generates interpretable analytics and reports in a live dashboard.
We are especially proud of the selective refinement design, which allows the system to scale by allocating heavy vision language reasoning only where it provides meaningful gains. This made the system practical to run under limited GPU budgets while still demonstrating semantic understanding.
We are also proud of the modular architecture and caching strategy, which allowed rapid iteration during the hackathon and makes the system easy to extend with new models, policies, or analytics without rewriting the pipeline.
What we learned
We learned that spatial intelligence is less about building a single powerful model and more about system design. Combining fast, deterministic perception with selective semantic reasoning produced a system that was both practical and expressive.
We also learned the importance of temporal modeling for real world understanding. Many meaningful insights only emerge when actions are aggregated over time rather than analyzed frame by frame.
Most importantly, this project reinforced that world models for physical environments require tight integration between perception, temporal reasoning, and human interpretable outputs. Turning raw video into operational insight is fundamentally a systems problem, not just a model problem.
What's next for SiteSense
In the near term, we plan to expand SiteSense from coarse activity states to richer task level understanding by learning site specific activity vocabularies and workflows. This would allow the system to distinguish between different types of productive work and different categories of idle time rather than treating all work or delays as the same.
We also plan to improve spatial reasoning by incorporating lightweight 3D structure from egocentric video, enabling the system to reason about proximity to tools, materials, and work zones. This would allow SiteSense to move from describing what happened to explaining where and why certain behaviors occurred in the physical layout of a site.
On the systems side, we aim to move from offline analysis to near real time feedback, enabling live alerts for prolonged idle periods, safety risks, or coordination issues. This opens the door to closed loop operational interventions rather than post hoc reporting.
Finally, we plan to generalize SiteSense beyond construction to logistics, manufacturing, inspections, and robotics, and to evaluate the system on new datasets to better understand its robustness across environments and camera perspectives.
Log in or sign up for Devpost to join the conversation.