Autonomous Cameraman

Yolo cropping the object we desire
Template matches the object on frame and displays its coordinates relative to the center of the frame
Hand default
The arm following a bottle

Inspiration

We want to ensure fair referee decisions in sports by building this tool for them. In fast-paced matches, it is incredibly difficult for human camera operators to track every sudden sprint, pass, or foul, which can lead to missed angles during critical video reviews. We aimed to create a robotic camera that never loses focus, tracking the game seamlessly and provide referees with the perfect angle every single time.

What it does

The Autonomous Cameraman is a real-time, AI-driven tracking system that locks onto a specific moving target (like an ball) and physically actuates a robotic arm to keep them perfectly centered in the frame. Instead of relying on a human operator, the system calculates spatial offsets and directly commands the pan, tilt, and lift joints of an SO101 Follower robotic arm to follow the action dynamically.

How we built it

We built the system as a highly multithreaded Python application that bridges computer vision with physical hardware. We used ultralytics to run a YOLOv8-segmentation model for precise object detection.

We implemented a MiDaS monocular depth estimation model to understand the 3D space and prevent the camera from locking onto background objects. To achieve real-time performance, we heavily optimized the AI to run on AMD Ryzen AI GPUs via Vitis AI and onnxruntime. We used on-the-fly INT8 dynamic quantization and pre-allocated zero-copy memory buffers to eliminate processing bottlenecks.

The physical actuation is handled by a custom control loop that translates pixel offsets into mechanical joint movements (shoulder pan, shoulder lift, and elbow flex) for the robotic arm.

Challenges we ran into

One of our biggest hurdles was the computational latency of running heavy deep learning models on every single frame, which caused the robotic tracking to lag behind the target's physical movements. We solved this by decoupling the architecture: the heavy AI runs asynchronously to identify the target, while a lightweight, lightning-fast contour refinement and template-matching loop tracks the target frame-by-frame. We also had to implement Exponential Moving Average (EMA) smoothing to eliminate centroid jitter, ensuring the robotic arm movements were fluid rather than robotic and jerky.

Moreover, when the object was no longer in the frame, the template matching algortihm would have false positve results due to the background still being present. To solve this, our system uses the MiDaS median depth model to constantly calculate the object's inverse depth, rejecting tracking matches if the object's depth drops drastically or falls below the previously known reference depth, indicating the target has moved away or behind an obstacle.

Accomplishments that we're proud of

We are incredibly proud of the hardware-level optimizations we achieved. By writing a custom execution pipeline that forces the ONNX models to compile and run directly on the NPU, we bypassed the usual CPU bottlenecks, however we moved on to running our model on the GPU since we saw an increased performance. We also successfully implemented depth-based background rejection, if a player briefly steps behind an obstacle or the depth suddenly drops, the tracker is smart enough to know the object has left the frame rather than wildly snapping to the background. Our two step object determination approach of using YOLO image segmentation and template matching enabled us to perform approximately 3x better than the simple YOLO implementation.

What we learned

We gained deep, practical experience in hardware-accelerated AI with AMD Mini PC, specifically learning how to tune execution providers, manage shared memory arenas, and utilize quantization to speed up inference. We also learned how to manage complex, multithreaded state in Python. Coordinating separate background workers for camera ingestion, depth mapping, AI segmentation, and motor control required careful use of thread locks and events to prevent race conditions and deadlocks.

What's next for Autonomous Cameraman

Moving forward we want to deploy our brains logic on an FPGA to achieve faster inference and rely on more deterministic calculations. We would also like to integrate predictive kinematics to preemptively move the arm.

Built With

python
robotics
so101
video-pipelines
vitis-ai

Updates

Miles Sierra started this project — Apr 18, 2026 10:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.