Inspiration

We were inspired by the gap between drone footage and actionable city maintenance. Cities and contractors fly drones over roads and infrastructure, but turning that video into repair tickets, including location, severity, and context, is still manual and slow. We wanted a system that could automatically detect issues from drone video, attach GPS and context, and surface them on a map-based dashboard so teams can prioritize and act. The vision was to fly once and get a live map of issues and tickets without training custom detectors by using foundation models and a simple, explainable pipeline.

What it does

Kavi is an autonomous city repair detection system in two parts:

Detection pipeline (Python)

You give it a video file or a live stream, such as from a DJI drone via RTMP. It runs frames through Meta’s SAM3 (Segment Anything Model 3) with text prompts like "pothole" and "road damage," so we get instance segmentation, consisting of masks and boxes, and confidence scores without training. The pipeline supports GPS telemetry in SRT, CSV, or JSON formats from the drone, frame-level deduplication, and optional Google Gemini analysis to turn a cropped frame into a structured ticket including severity, effort, safety and traffic impact, and a reverse-geocoded address. Those tickets can be written to Supabase for database storage and frame hosting.

Dashboard (Next.js)

A map-centric UI using Leaflet and a dark theme with "liquid glass" panels shows detection markers and allows filters by issue type, severity, and status. It features a timeline for time-range filtering and a Live tab that displays the DJI HLS stream, proxied through the app, when the RTMP server is running. The dashboard reads from Supabase so detections and tickets appear in real time as the pipeline runs.

The end-to-end process is:

How we built it

Model stack: We use Hugging Face Transformers with SAM3 (facebook/sam3) for text-prompt instance segmentation. Frames are preprocessed with resizing and CLAHE (Contrast Limited Adaptive Histogram Equalization) for contrast, then sent to SAM3 where we filter by confidence and simple size or aspect rules. For video files, we also support SAM3’s native video-tracking API to track the same pothole across frames. Telemetry is interpolated by frame index so every detection gets a latitude and longitude when telemetry is available.

Ticket creation: When we have a detection crop and GPS, we optionally call Gemini Vision to get issue type, severity, description, effort estimate, and risk. We reverse-geocode using Nominatim for street name and district, then upsert into Supabase with a tickets table and frame images in storage. The dashboard subscribes to Supabase and updates the map and sidebar.

Live stream: DJI Fly streams RTMP to a local nginx-rtmp Docker server, which outputs HLS on port 8080. The Next.js app has an HLS proxy route that fetches from that backend so the Live tab uses same-origin URLs. The Python pipeline can consume the same RTMP URL with --live to run SAM3 on the live feed for detections and saved frames, while the video shown in the UI is the raw HLS stream.

Frontend: Built with Next.js 14 App Router, Tailwind, Framer Motion, Leaflet, and a Supabase client. The map, timeline, filters, and Live panel are separate components; the main page composes them and passes filters and time range so the map and sidebar stay in sync.

Challenges we ran into

SAM3 API and shape handling: The initial code assumed an older detection API, requiring us to align with the official post_process_instance_segmentation and pass target_sizes from the processor’s original_sizes so masks and boxes matched the image. We also hit encoder shape errors on large or odd-sized frames; we added a configurable max_side of 720px so frames are resized before SAM3 and errors went away.

Live stream versus what the user sees: The RTMP server publishes one HLS stream, and the Python pipeline reads from that same RTMP source for detection. The live transmission in the UI is the raw feed because we do not yet composite SAM3 segmentation onto the stream. Doing that would require a separate encode and publish step to read RTMP, draw masks, re-encode, and publish a second HLS stream.

HLS proxy and ECONNREFUSED: The dashboard requests the playlist via /api/hls/..., which fetches from localhost:8080. If the RTMP/HLS server is not running, that fetch fails with ECONNREFUSED and we return 502. We added a clear error message and return a stub M3U8 when the backend returns 404 for the playlist so the player does not crash.

Cross-platform GPU: We wanted the model to run on Apple Silicon (MPS) as well as CUDA. We updated device selection to use MPS when available so SAM3 runs on Metal instead of CPU on M1, M2, and M3 Macs.

Accomplishments that we're proud of

Zero custom training: We get pothole and road-damage segmentation from SAM3 text prompts only, with no labeled dataset or fine-tuning required. That makes the system easy to extend to other concepts like "crack" or "faded marking" by changing prompts.

Full pipeline from video to tickets: One flow from drone video to SAM3 detections, optional Gemini analysis, Supabase tickets with location and severity, and a dashboard that shows them on a map with filters and a live feed. It is a working prototype of fly and get a ticket list.

Live DJI integration: We documented and wired DJI Air 3S and DJI Fly to RTMP, then nginx-rtmp to HLS, through a Next.js proxy to the Live tab. The Python pipeline can run SAM3 on that same stream.

Clear separation of concerns: The model (SAM3 and Gemini), pipeline (video, telemetry, and deduplication), ticket creation (Supabase), and dashboard (Next.js and Supabase) are separated so we can improve or swap pieces without rewriting the whole system.

What we learned

Foundation models for civic tech: Using a general-purpose segmentation model like SAM3 with text prompts is enough to get useful pothole and road-damage masks without training. The main work is preprocessing, filtering, telemetry alignment, and user experience.

HLS/RTMP in a web app: Proxying HLS through the Next.js API avoids CORS and keeps stream URLs same-origin; handling 404 with a stub playlist improves UX when the stream is not up yet. We also learned how DJI Fly’s RTMP and nginx-rtmp’s HLS output fit together regarding stream key, app name, and ports.

Device and API details matter: Getting SAM3 to run reliably meant matching the processor’s expected inputs and post-processing, and handling image dimensions to avoid encoder errors. Supporting MPS required explicitly checking torch.backends.mps.is_available() and using it when CUDA is not available.

What's next for Kavi

Segmentation overlay on live video: Encode a second HLS stream that draws SAM3 masks and boxes on the live feed so operators see detections directly on the video in the Live tab.

More issue types and prompts: Extend SAM3 prompts and Gemini descriptions to cracks, faded markings, standing water, and debris, and expose these as filterable types in the dashboard.

Prioritization and routing: Use severity, traffic impact, and effort from Gemini, along with optional historical data, to suggest priority and which crew or district to assign.

Batch and archival workflows: Improve batch video processing with a folder of flights, better handling of long videos for memory efficiency, and optional archival export in GeoJSON or CSV for GIS tools.

Auth and multi-tenant: Add authentication and scope tickets and streams by organization or city so multiple teams can use the same stack safely.

Health Scoring Formula

We implemented a risk-based city health scoring formula where the final score is calculated based on total risk points and a saturation constant :

This ensures the score stays high for minor issues but decreases logarithmically as risk accumulates.

Built With

Share this project:

Updates