Inspiration

Aerial footage has become the default language of real estate listings, event recaps, and brand campaigns — but getting good aerial footage still requires a skilled pilot who knows where to point the camera. The creative decisions (what's interesting, how to frame it, which angle tells the story) are made entirely by intuition, on the fly, with a 30-minute battery ticking down.

We wanted to ask: what if the drone had taste? What if it could look at a building, understand what's visually interesting about it, and plan a cinematic shot list — autonomously?


What it does

PRAL takes a single input — a pin dropped on a map — and produces a curated set of marketing-ready aerial footage clips, fully autonomously.

It runs a five-stage pipeline:

  1. Select — confirms the target by cross-referencing OpenStreetMap footprints and onboard vision detection.
  2. Survey — flies a lawnmower pass to measure building height and build a 3D obstacle map.
  3. Map — orbits the structure and reconstructs a 3D model, then adaptively fills coverage gaps with next-best-view close-ups.
  4. Analyze — scores every point on the 3D surface using a computer vision model's attention maps, producing a per-vertex interest field that captures what the model "finds visually compelling."
  5. Shoot — samples hundreds of candidate viewpoints, scores them against the interest field, selects 10 waypoints with enforced shot-type variety (orbit, reveal, push-in, top-down), and outputs a smooth trajectory checked against the obstacle map.

A Next.js frontend lets the human review and approve each stage, browse the curated clips, choose an output style, and export.


How we built it

The system is split into two decoupled halves joined by a single hand-off artifact — the Curated Footage Set (clips + camera poses + points of interest + quality scores).

Pipeline (Python): Each stage is a self-contained runner with a full mock counterpart so the entire pipeline runs end-to-end without hardware. The core intelligence lives in a self-supervised computer vision model whose attention maps drive the interest field that the rest of the pipeline optimizes around.

Hardware integration (DJI Mobile SDK + Payload SDK): The pipeline's waypoint commands are translated into drone instructions via the DJI Mobile SDK (iOS) and DJI Payload SDK. Each stage outputs a structured mission — a list of waypoints with position, gimbal pitch, and camera trigger parameters — that is ingested by a lightweight iOS companion app. The DJI Waypoint Mission API executes the trajectory autonomously, handling low-level flight stabilization, obstacle avoidance (APAS), and return-to-home. The Payload SDK manages camera control and gimbal articulation as a tightly integrated layer, while the Mobile SDK handles mission upload, telemetry streaming, and flight-mode switching. For the final shoot stage, the companion app streams live telemetry (GPS pose, gimbal orientation, battery SOC) back to the pipeline so the quality gate can validate that each shot was captured from the planned viewpoint within tolerance. During development, every SDK call has a software mock so the full pipeline can be exercised without physical hardware on the bench.

Frontend (Next.js + Three.js): A human-in-the-loop approval UI where the FlightStudio component renders the full planned trajectory in 3D — survey ascent, orbit ring, and final shot path — against satellite imagery.


Challenges we ran into

Tying geometry to interest. Getting a vision model's attention maps to correctly accumulate onto a 3D surface required careful ray-casting, per-triangle interpolation, and incidence-angle weighting so head-on views don't get diluted by grazing-angle noise.

Coverage without flying forever. The next-best-view system balances information gain against battery life — ranking candidate viewpoints by unseen surface area divided by flight cost, with a hard battery floor.

Shot variety as a constraint. A pure value-greedy selection produces redundant shots (five push-ins of the same facade). We encoded shot-type quotas as hard slots and sequenced the result with a TSP solver.

Quality gates. Knowing when the 3D model is good enough to hand off required combining multiple signals — reprojection error, reconstruction quality, and surface coverage — with any failure degrading gracefully rather than crashing.


Accomplishments that we're proud of

  • A fully traced pipeline from pin drop to Curated Footage Set, with every stage documented from design spec through implementation and acceptance criteria.
  • A per-vertex interest field that fires on semantically meaningful geometry (apertures, signage, architectural detail) using only self-supervised signals — no labeled training data.
  • Mock infrastructure rigorous enough that every stage runs without a drone, a GPU, or a real 3D workspace.
  • A 3D flight visualization that makes abstract concepts (coverage maps, interest fields, trajectory splines) immediately legible to a non-technical reviewer.

What we learned

Self-supervised computer vision models encode enough about visual saliency that you can build a principled "interest field" without any task-specific labels. The hard part isn't the scoring — it's grounding that score in 3D space accurately enough to drive a physical trajectory.

Shot variety isn't a soft preference; it's a constraint that has to be encoded explicitly. Left unconstrained, any value-maximizing optimizer will cluster on the single most photogenic corner of the building.

The most valuable engineering decision we made was defining the Curated Footage Set schema early and strictly. It let the frontend and pipeline to be developed in parallel with zero coordination overhead.


What's next for PRAL

  • Live 3D refinement — updating the scene model as the drone flies rather than processing offline.
  • Semantic shot targeting — accepting natural-language briefs ("emphasize the rooftop terrace") and biasing the interest field via text embeddings.
  • Multi-structure missions — extending the planner to cover several buildings in a single flight.
  • Automated edit — cutting the Curated Footage Set into a finished reel using shot-type metadata and a music-sync beat detector.

Built With

Share this project:

Updates