Moonshot

Inspiration

Today, construction sites collect hours of POV video footage daily, but supervisors are still required to manually sort through and analyze productivity and safety data. We were inspired by one question: How can we automatically turn raw footage into a structured report for supervisors? We know AI can describe what it can see; however, it struggles to convert that information into measurable, applicable, and time-based insight. Our goal was to bridge that gap and transform passive footage into superior-ready reports.

Moonshot analyzes real construction site activity using video from the worker’s point of view or fixed cameras. It automatically detects what happens on the jobsite, identifies important events, and evaluates them across safety, productivity, and quality. The system turns raw footage into structured timelines, scores, and insights so supervisors can quickly understand how work is actually being done, spot risks, and make better operational decisions without watching hours of video.

We built Moonshot by designing a step-by-step AI pipeline that turns raw worksite video into clear, structured insights. We aggregated state-of-the-art models—Cosmos-Reason2-8B for visual understanding and dense captioning, Llama 3.1 8B for summarization and reasoning, and NVIDIA’s EmbedQA and RerankQA for retrieval—to power AI-driven analysis through the NVIDIA Video Search and Summarization (VSS) architecture. Google Gemini then structures the output into actionable reports with safety, productivity, and quality metrics.

Our backend architecture required lots of VRAM dedicated on one GPU for maximum performance. Our calculations came up to over 80gb+, so we had to work around with what Vast.ai offered to get the perfect instance, cost wise, and feasibility wise. Lots of issues came up when deploying the VSS architecture, but we were able to successfully deploy our backend by incrementally adding in the models and other features.
File upload sizes were limited to 4.5mb on Vercel, so we utilized UploadThing to properly handle file storing and sending to our backend.
The VLM could only parse video of shorter duration and size, (around 2-3 minutes at a time). So when we were testing with the provided videos and the data we found, we created a python script which splices all the videos to 2 minute clips.

AI is extremely useful for recognition, but struggles with structuring its findings.
Aggregating and bringing together different models is powerful for solving complex, multifaceted problems.

Currently, Moonshot simply allows for a user to upload footage and then get it summarized and graded. A relatively simple, yet major improvement would be to allow for live footage to be uploaded to the Moonshot database. Creating multiple worker profiles in order to compare worker performance is also a needed improvement if we want to see our product working in a real workspace. Standardizing the way information is collected from the video stream, how it's processed, and finding a more accurate way to calculate quality, productivity, and safety will also make the data more useful for Moonshot's users. On the technical side, improving the prompts used throughout our pipeline, replacing AI with actual code in certain sections.