CCTV footages are often very long in duration but short in content. One usually spend a lot of time skipping most parts of footage before arriving at the content of interest. Therefore, a lot of time can be saved if the content of interest can be smartly summarized using artificial intelligence.
What it does
CCTV footage is read and in each frame, the object detection algorithm is used to identify a list of pre-defined objects of interest (car, person etc.). If any of the objects of interest is present in a given frame, the frame is kept and otherwise, it is discarded.
How I built it
Single Shot MultiBox Detector (SSD) model for object detection by Nvidia is used to identify the objects in each frame of the footage. In this work, the pre-trained model which is trained on COCO dataset, support GPU which speeds up the inference process.
Challenges I ran into
There are two main challenges. The first challenge is about reading footage frames and writing the summarized frame to a new video using CV2 because the input format of CV2 is different from that of the object detection model. So, input and output tensors must be properly formated. The second challenge is that running inference on each frame is slower than expected and does not take full advantage of the GPU. To resolve, the batch inference (multiple frames) is implemented to take advantage of matrix multiplication.
Accomplishments that I'm proud of
I am proud of the batch inference (multiple frames) implementation since this requires a significant understanding of how inputs/output works and how to write multiple frames in open-cv.
What I learned
I have learnt in-depth about how object detection, tensors, PyTorch and open-cv works.
What's next for CCTV Footage Summarization Using Pre-trained SSD
From this, we can extend: 1- to extract specific content from the footage (ex. counting persons/vehicles ) 2- to work with real-time video stream 3- to allow user to define the regions of interest to detect the objects.