Inspiration

Autonomous harvesters are already in the fields, but the people working alongside them are still at risk. We kept coming back to one uncomfortable truth: a missed detection isn't a bug, it's a fatality. We wanted to see how far we could push camera-only perception to close that gap, using nothing more than the hardware already bolted to the machine.

What it does

Our solution detects people in real time and estimates their exact distance from the vehicle in metres - no LiDAR, no radar, just cameras. If anyone comes within 15 metres, the system triggers a stop signal and flags the detection in red. Above that threshold, the machine keeps moving. It also cross-validates its own depth estimates between two independent models, so it knows when it's uncertain and can be extra cautious in those moments.

How we built it

We combined three models running in sequence on each camera frame. YOLO11x-seg gives us pixel-precise person masks: not just bounding boxes, but exact silhouettes so we sample depth from the person and not the wheat behind them. UniDepth V2 turns those depth values into real metres without any calibration target. MoGe-2 runs alongside as a second opinion, and where the two models disagree we surface that as an uncertainty heatmap. The whole pipeline runs on a single GPU and outputs annotated frames, 3D point clouds, and video.

Challenges we ran into

Getting metric depth you can actually trust in the middle of a dusty field is harder than any benchmark suggests. Strong backlight, dust clouds, and partial occlusion in tall crops all push the models into disagreement. We also spent a painful amount of time on segmentation mask alignment. YOLO generates masks at a lower resolution, and without careful resizing you end up sampling background depth right through the edge of a person's silhouette.

Accomplishments that we're proud of

The system works. On real field imagery, person detections are consistent, distance estimates from the two depth models agree to within half a metre in normal conditions, and the pipeline correctly ignores crop structures, machinery, and other non-person objects without a single false stop.

What we learned

Monocular depth estimation is mature enough to be useful, but it needs a second model watching over its shoulder for the safety requirements of autonomous driving, and it needs to know when to say it's not sure. We also learned that precision matters just as much as recall in safety systems: a system that stops the harvester every time it sees a shadow will be switched off by the operator on day one.

What's next for HackHPI - CLAAS Challenge - Team LOL

The obvious next step is the edge. We're targeting sub-100ms per frame on an NVIDIA Jetson Orin using a quantised YOLO variant and reduced-resolution depth inference: fast enough for a harvester at field speed. Beyond that, multi-camera fusion would let us triangulate person positions across overlapping fields of view, turning monocular estimates into something closer to stereo ground truth. And every danger event the system flags in the field is a training sample, over a fleet of thousands of machines, that becomes a self-improving safety system that gets materially better every season.

Built With

Share this project:

Updates