Inspiration
Construction had 1,075 preventable fatalities in 2023. Current monitoring solutions are either prohibitively expensive, require dedicated manual oversight, or are limited to single-task detection (e.g., detecting only hard hats). We needed a scalable, automated solution that leverages existing camera infrastructure to provide comprehensive, end-of-shift safety analytics without human intervention.
What it does
An automated video analysis pipeline that ingests footage from fixed wall cameras and POV body-cams to output structured, per-worker safety reports. Core capabilities include:
- PPE Detection: Tracks hard hats, vests, gloves, eyewear, and respirators with frame-accurate evidence.
- Ergonomic Analysis: Calculates REBA-inspired joint angles using pose estimation to flag overreaching, awkward postures, and hazardous lifting.
- Proximity & Behavior: Identifies restricted zone breaches and unsafe proximity to heavy machinery.
- OSHA Mapping: Automatically maps detected violations to specific OSHA standards (e.g., 1926.501) with severity scoring.
How we built it
The system runs on a FastAPI backend, utilizing a Next.js 14/TypeScript frontend, and processes 60-second video chunks at 10 FPS through a shared three-stage pipeline:
- Stage 1 (Detection & Tracking): Fine-tuned YOLO11 handles PPE, equipment, and scaffolding detection. YOLO Pose extracts 17 COCO keypoints per worker. BoT-SORT maintains persistent track IDs across frames.
- Stage 2 (Refinement): SAM 3 converts bounding boxes into pixel-accurate segmentation masks. We use center-point containment logic and temporal smoothing to accurately associate specific PPE with the correct worker in crowded frames.
- Stage 3 (Verification): To eliminate VLM hallucinations, we built a 3-pass adversarial Chain-of-Thought protocol using a fine-tuned Qwen3-VL-8B-Instruct (with LoRA). Pass 1 acts as a blind baseline, Pass 2 evaluates annotated frames independently, and Pass 3 reconciles the outputs into structured JSON. Confidence gating against YOLO scores (< 0.40 requires independent confirmation) ensures data integrity.
Challenges we ran into
- VLM Hallucinations: Single-pass prompting led to unacceptably high false-positive rates. We had to engineer the multi-pass adversarial architecture to constrain the VLM to factual, evidence-based outputs.
- Spatial Attribution: Matching PPE to workers in dense crowds failed with basic bounding boxes. Implementing SAM 3 masks and center-point logic was required to fix misattribution.
- Ergonomic Thresholding: Distinguishing between safe movement and hazardous posture required strict mathematical thresholding (e.g., > 48° trunk flexion) and a hard 0.65 keypoint confidence gate to filter out noise caused by baggy clothing or occlusions.
- GPU Memory Management: Running YOLO, SAM 3, and an 8B VLM concurrently on 4x RTX PRO 6000 GPUs caused immediate OOM errors. We solved this via strict sequential inference processing, model pre-warming, and aggressive video chunking.
Accomplishments that we're proud of
- Neutralizing VLM Hallucinations: Successfully used our adversarial verification loop to prove that VLMs can be used reliably in safety-critical workflows.
- Actionable Metrics: Translating raw YOLO pose keypoints into actionable occupational health metrics (REBA scores) that a safety officer can actually use.
- Production-Ready Product: Building a complete product—from raw dual-camera video ingestion to a fully functional React dashboard—rather than just a Jupyter notebook proof-of-concept.
What we learned
- Engineering over Prompting: You cannot prompt away VLM hallucinations; they must be engineered out through architectural constraints like multi-agent reconciliation.
- OSHA Compatibility: OSHA regulations are surprisingly machine-friendly. Their violation taxonomy maps cleanly to computer vision tasks (object detection, spatial reasoning, temporal analysis).
- Practical Processing: Asynchronous, end-of-shift processing is significantly more practical for heavy ML pipelines than attempting real-time edge compute, providing higher accuracy without a six-figure infrastructure bill.
What's next for Construction Site Safety Intelligence Dashboard
- Advanced Worker Re-identification: Implementing appearance embeddings and cosine similarity matching to maintain persistent worker IDs through long visual occlusions.
- Spatial Analytics: Generating site occupancy heatmaps using wall-cam coordinate data to optimize site layouts and proactively identify hazard zones.
- Real-Time Processing: Adding RTSP/WebRTC ingestion to support live monitoring and batch analysis.
- Enterprise Integrations: Building export pipelines to push events directly into industry-standard EHS platforms like Procore and Autodesk Construction Cloud.


Log in or sign up for Devpost to join the conversation.