SkyGuardian

Inspiration

A dismounted soldier moving through an unfamiliar building or ridgeline is effectively blind beyond the next wall. The tools that could fix that — networked drones, cloud vision, GPS — all assume infrastructure that doesn't exist in a contested or disaster zone. Jamming, dead zones, and denied GPS are the norm, not the exception.

So we asked a narrower question: how much real-time aerial situational awareness can you give one person on the ground with zero connectivity? No cloud. No internet. No GPS. Just a laptop, two drones, and the operator who needs to see.

What it does

SkyGuardian is an offline-first, manned-unmanned drone team that gives a single operator live aerial recon and a running threat picture.

A human-piloted Mavic (the Leader) streams video to a laptop "brain."
A Tello (the Follower) station-keeps on the operator using an on-device visual "me" lock, so they can keep moving. A worn AprilTag lets the operator designate other targets — a vehicle, a position, another person — without ever losing the self-follow.
The laptop runs the full perception stack locally: monocular SLAM for pose, depth estimation, and an open-vocabulary detector that drops every entity into a metre-scale local map.
On top of perception sits on-device tactical reasoning — a local vision LLM (our offline answer to "Gemini Live") that produces a rolling threat assessment and answers operator questions in plain language.
It all renders on an operator dashboard: live feed with detection overlays, a 2D/3D map with OSM building footprints, a threat board, an intel chat, and a radar inset tracking the follower drone.

One rule is hard-wired and non-negotiable: recon and situational awareness only. No engagement, ever. stop and recall are always-live and override every other mission state.

How we built it

The brain (laptop, FastAPI + asyncio). An RTMP receiver pulls the Mavic feed into a perception pipeline running monocular visual odometry (ORB / essential-matrix), an AprilTag metric anchor for scale, a YOLO-World open-vocabulary detector ensembled with a COCO model and a specialty weapons detector, and optional DepthAnything-V2 monocular depth. Detections are fused with SLAM pose into 3D entities living in one shared world model.

Reasoning, fully offline. A local Ollama vision model (Gemma 3) runs a rolling threat-assessment loop and powers an operator Q&A chat grounded in the current feed. No tokens ever leave the laptop.

The dashboard (Next.js 14 + Tailwind + Three.js). Polled-JPEG video with bounding-box overlays, a 2D top-down and a 3D R3F map, the intel summary and chat, the threat board, and a self-contained follow-radar inset.

The follower (SwiftUI iOS app). The phone is the primary Tello controller — it joins the drone's AP, runs the visual "me" follow loop on-device, and publishes mission intent, device location, and a relative follow_state back to the brain over WebSocket.

Contracts + safety. Every subsystem agrees on a shared Entity shape and a typed WebSocket protocol (Python / TypeScript / Swift mirrors). A software arming interlock guarantees only one controller drives the Tello at a time.

Challenges we ran into

No GPS meant no shortcuts. Pose had to come from monocular SLAM plus an AprilTag scale anchor. Getting metre-accurate entity placement from a single moving camera was the hardest part.
Two uncoordinated frames. The phone's follow frame and the Mavic SLAM frame aren't co-registered, so we couldn't put the follower on the same map. We solved it by sending follow_state as a relative range/bearing radar instead of fake map coordinates.
One drone, two would-be pilots. The phone and the laptop can both command the Tello. We added an arming lock plus a TELLO_DISABLE mode so the phone is the sole controller in the live demo and the two can never fight for the link.
A vision LLM that's actually usable offline. Tuning interval, resolution, and a text-only fast path to keep the local model responsive without a GPU in the field.

Accomplishments we're proud of

A complete perception → reasoning → dashboard pipeline that makes zero network calls.
A working dual-end live demo: laptop-side recon plus a phone-flown follower on one Tello AP.
A real safety architecture — single-controller arming lock, always-live recall priority, and a no-engagement constraint baked into the state machine, not bolted on.

What we learned

Offline is a design constraint that forces better engineering. Stripping out the cloud meant every assumption — pose, detection, reasoning — had to earn its place on one laptop, and the result is something that actually survives a denied environment instead of demoing well on hotel wifi.

What's next for SkyGuardian

A data flywheel into Palantir Foundry — captured recon already exports to a Foundry ontology for back-at-base review and model retraining.
Multiple followers, ATAK integration, and an edge accelerator for higher-FPS perception on lighter hardware.

Built With

apriltags
asyncio
depth-anything-v2
dji-mavic
dji-tello
djitellopy
fastapi
gemma
mediamtx
next.js
ollama
opencv
openstreetmap
palantir-foundry
pytest
python
react
react-three-fiber
rtmp
swift
swiftui
tailwindcss
three.js
typescript
ultralytics
vitest
websockets
yolo
yolo-world

Updates

Nicolas Dos Santos started this project — May 31, 2026 01:51 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.