Inspiration
A dismounted soldier moving through an unfamiliar building or ridgeline is effectively blind beyond the next wall. The tools that could fix that — networked drones, cloud vision, GPS — all assume infrastructure that doesn't exist in a contested or disaster zone. Jamming, dead zones, and denied GPS are the norm, not the exception.
So we asked a narrower question: how much real-time aerial situational awareness can you give one person on the ground with zero connectivity? No cloud. No internet. No GPS. Just a laptop, two drones, and the operator who needs to see.
What it does
SkyGuardian is an offline-first, manned-unmanned drone team that gives a single operator live aerial recon and a running threat picture.
- A human-piloted Mavic (the Leader) streams video to a laptop "brain."
- A Tello (the Follower) station-keeps on the operator using an on-device visual "me" lock, so they can keep moving. A worn AprilTag lets the operator designate other targets — a vehicle, a position, another person — without ever losing the self-follow.
- The laptop runs the full perception stack locally: monocular SLAM for pose, depth estimation, and an open-vocabulary detector that drops every entity into a metre-scale local map.
- On top of perception sits on-device tactical reasoning — a local vision LLM (our offline answer to "Gemini Live") that produces a rolling threat assessment and answers operator questions in plain language.
- It all renders on an operator dashboard: live feed with detection overlays, a 2D/3D map with OSM building footprints, a threat board, an intel chat, and a radar inset tracking the follower drone.
One rule is hard-wired and non-negotiable: recon and situational awareness only. No engagement, ever. stop and recall are always-live and override every other mission state.
How we built it
The brain (laptop, FastAPI + asyncio). An RTMP receiver pulls the Mavic feed into a perception pipeline running monocular visual odometry (ORB / essential-matrix), an AprilTag metric anchor for scale, a YOLO-World open-vocabulary detector ensembled with a COCO model and a specialty weapons detector, and optional DepthAnything-V2 monocular depth. Detections are fused with SLAM pose into 3D entities living in one shared world model.
Reasoning, fully offline. A local Ollama vision model (Gemma 3) runs a rolling threat-assessment loop and powers an operator Q&A chat grounded in the current feed. No tokens ever leave the laptop.
The dashboard (Next.js 14 + Tailwind + Three.js). Polled-JPEG video with bounding-box overlays, a 2D top-down and a 3D R3F map, the intel summary and chat, the threat board, and a self-contained follow-radar inset.
The follower (SwiftUI iOS app). The phone is the primary Tello controller — it joins the drone's AP, runs the visual "me" follow loop on-device, and publishes mission intent, device location, and a relative follow_state back to the brain over WebSocket.
Contracts + safety. Every subsystem agrees on a shared Entity shape and a typed WebSocket protocol (Python / TypeScript / Swift mirrors). A software arming interlock guarantees only one controller drives the Tello at a time.
Challenges we ran into
- No GPS meant no shortcuts. Pose had to come from monocular SLAM plus an AprilTag scale anchor. Getting metre-accurate entity placement from a single moving camera was the hardest part.
- Two uncoordinated frames. The phone's follow frame and the Mavic SLAM frame aren't co-registered, so we couldn't put the follower on the same map. We solved it by sending
follow_stateas a relative range/bearing radar instead of fake map coordinates. - One drone, two would-be pilots. The phone and the laptop can both command the Tello. We added an arming lock plus a
TELLO_DISABLEmode so the phone is the sole controller in the live demo and the two can never fight for the link. - A vision LLM that's actually usable offline. Tuning interval, resolution, and a text-only fast path to keep the local model responsive without a GPU in the field.
Accomplishments we're proud of
- A complete perception → reasoning → dashboard pipeline that makes zero network calls.
- A working dual-end live demo: laptop-side recon plus a phone-flown follower on one Tello AP.
- A real safety architecture — single-controller arming lock, always-live recall priority, and a no-engagement constraint baked into the state machine, not bolted on.
What we learned
Offline is a design constraint that forces better engineering. Stripping out the cloud meant every assumption — pose, detection, reasoning — had to earn its place on one laptop, and the result is something that actually survives a denied environment instead of demoing well on hotel wifi.
What's next for SkyGuardian
- A data flywheel into Palantir Foundry — captured recon already exports to a Foundry ontology for back-at-base review and model retraining.
- Multiple followers, ATAK integration, and an edge accelerator for higher-FPS perception on lighter hardware.
Built With
- apriltags
- asyncio
- depth-anything-v2
- dji-mavic
- dji-tello
- djitellopy
- fastapi
- gemma
- mediamtx
- next.js
- ollama
- opencv
- openstreetmap
- palantir-foundry
- pytest
- python
- react
- react-three-fiber
- rtmp
- swift
- swiftui
- tailwindcss
- three.js
- typescript
- ultralytics
- vitest
- websockets
- yolo
- yolo-world
Log in or sign up for Devpost to join the conversation.