Inspiration

Construction had 1,075 preventable fatalities in 2023. Current monitoring solutions are either prohibitively expensive, require dedicated manual oversight, or are limited to single-task detection (e.g., detecting only hard hats). We needed a scalable, automated solution that leverages existing camera infrastructure to provide comprehensive, end-of-shift safety analytics without human intervention.

What it does

An automated video analysis pipeline that ingests footage from fixed wall cameras and POV body-cams to output structured, per-worker safety reports. Core capabilities include:

  • PPE Detection: Tracks hard hats, vests, gloves, eyewear, and respirators with frame-accurate evidence.
  • Ergonomic Analysis: Calculates REBA-inspired joint angles using pose estimation to flag overreaching, awkward postures, and hazardous lifting.
  • Proximity & Behavior: Identifies restricted zone breaches and unsafe proximity to heavy machinery.
  • OSHA Mapping: Automatically maps detected violations to specific OSHA standards (e.g., 1926.501) with severity scoring.

How we built it

The system runs on a FastAPI backend, utilizing a Next.js 14/TypeScript frontend, and processes 60-second video chunks at 10 FPS through a shared three-stage pipeline:

  • Stage 1 (Detection & Tracking): Fine-tuned YOLO11 handles PPE, equipment, and scaffolding detection. YOLO Pose extracts 17 COCO keypoints per worker. BoT-SORT maintains persistent track IDs across frames.
  • Stage 2 (Refinement): SAM 3 converts bounding boxes into pixel-accurate segmentation masks. We use center-point containment logic and temporal smoothing to accurately associate specific PPE with the correct worker in crowded frames.
  • Stage 3 (Verification): To eliminate VLM hallucinations, we built a 3-pass adversarial Chain-of-Thought protocol using a fine-tuned Qwen3-VL-8B-Instruct (with LoRA). Pass 1 acts as a blind baseline, Pass 2 evaluates annotated frames independently, and Pass 3 reconciles the outputs into structured JSON. Confidence gating against YOLO scores (< 0.40 requires independent confirmation) ensures data integrity.

Challenges we ran into

  • VLM Hallucinations: Single-pass prompting led to unacceptably high false-positive rates. We had to engineer the multi-pass adversarial architecture to constrain the VLM to factual, evidence-based outputs.
  • Spatial Attribution: Matching PPE to workers in dense crowds failed with basic bounding boxes. Implementing SAM 3 masks and center-point logic was required to fix misattribution.
  • Ergonomic Thresholding: Distinguishing between safe movement and hazardous posture required strict mathematical thresholding (e.g., > 48° trunk flexion) and a hard 0.65 keypoint confidence gate to filter out noise caused by baggy clothing or occlusions.
  • GPU Memory Management: Running YOLO, SAM 3, and an 8B VLM concurrently on 4x RTX PRO 6000 GPUs caused immediate OOM errors. We solved this via strict sequential inference processing, model pre-warming, and aggressive video chunking.

Accomplishments that we're proud of

  • Neutralizing VLM Hallucinations: Successfully used our adversarial verification loop to prove that VLMs can be used reliably in safety-critical workflows.
  • Actionable Metrics: Translating raw YOLO pose keypoints into actionable occupational health metrics (REBA scores) that a safety officer can actually use.
  • Production-Ready Product: Building a complete product—from raw dual-camera video ingestion to a fully functional React dashboard—rather than just a Jupyter notebook proof-of-concept.

What we learned

  • Engineering over Prompting: You cannot prompt away VLM hallucinations; they must be engineered out through architectural constraints like multi-agent reconciliation.
  • OSHA Compatibility: OSHA regulations are surprisingly machine-friendly. Their violation taxonomy maps cleanly to computer vision tasks (object detection, spatial reasoning, temporal analysis).
  • Practical Processing: Asynchronous, end-of-shift processing is significantly more practical for heavy ML pipelines than attempting real-time edge compute, providing higher accuracy without a six-figure infrastructure bill.

What's next for Construction Site Safety Intelligence Dashboard

  • Advanced Worker Re-identification: Implementing appearance embeddings and cosine similarity matching to maintain persistent worker IDs through long visual occlusions.
  • Spatial Analytics: Generating site occupancy heatmaps using wall-cam coordinate data to optimize site layouts and proactively identify hazard zones.
  • Real-Time Processing: Adding RTSP/WebRTC ingestion to support live monitoring and batch analysis.
  • Enterprise Integrations: Building export pipelines to push events directly into industry-standard EHS platforms like Procore and Autodesk Construction Cloud.

Built With

Share this project:

Updates