Real-Time Object Detection and Distance Estimation Using WebRTC and AI

Overview

This project enables real-time object detection and distance estimation from a live video stream using WebRTC and AI models such as YOLOv8 and MediaPipe Pose. The system processes video frames, detects objects and humans, estimates their distances, and transmits the results via a Socket.io server.

Key Features

WebRTC-Based Video Streaming: Uses aiortc to stream video in real-time.
YOLOv8 Object Detection: Detects common objects in video frames.
Distance Estimation: Estimates distances of detected objects using bounding box sizes and reference heights.
MediaPipe Pose Tracking: Identifies human poses and estimates distances based on shoulder width.
Asynchronous Socket Communication: Uses socketio for real-time transmission of detected objects and their distances.

Technologies Used

Python & Asyncio: For non-blocking, real-time processing.
OpenCV: For image processing.
YOLOv8 (Ultralytics): Object detection model.
MediaPipe Pose: Human pose estimation.
WebRTC (aiortc): Video streaming framework.
Socket.io: Real-time client-server communication.
Aiohttp: Asynchronous web framework.

How It Works

Client Connection:
- The client sends a WebRTC offer to establish a peer-to-peer connection.
- The server responds with an answer to complete the WebRTC handshake.
Video Processing:
- The VideoTransformTrack class processes incoming video frames.
- YOLOv8 detects objects and calculates their distances using a predefined size reference.
- MediaPipe Pose detects human poses and estimates distances based on shoulder width.
Real-Time Data Transmission:
- Detection results, including object labels, confidence scores, and estimated distances, are sent to the client via Socket.io.
- FPS (frames per second) is calculated and included in the response.
Connection Management:
- The server manages multiple peer connections and cleans up stale connections periodically.

Distance Estimation Approach

For objects: [ \text{distance} = \frac{\text{real height} \times \text{focal length}}{\text{bounding box height}} ]
- Uses a reference height table for common objects.
For humans: [ \text{distance} = \frac{\text{shoulder width (40 cm)} \times \text{focal length}}{\text{shoulder width (pixels)}} ]
- Adjusted with a calibration factor for accuracy.