Often times when watching a lecture video the content that the lecturer writes on the board is quite representative of the the content of the lecture as a whole. This content, if provide in form of images, can become a helpful tool for students for revising concepts.

What it does

My project takes as input lecture videos and generates summary images that consist of all the content that was written on the board. Also a student can click on the board content to set the video to the timestamp at which that particular content was written.

How I built it

The video is first split into frames by sampling at 1fps

Challenges I ran into

Since i did not have access to a GPU locally, testing the whole pipeline was difficult as inference time on my MacBook Air was around 8-10 seconds per frame and I needed to annotate over thousand images. I ended up having to rely on google colab to generate the annotations and then use them locally.

Built With

Share this project:


posted an update

At the time of submission due to some issues I wasn't able to write the description section for the project completely so I am posting it as an update.

Accomplishments that we're proud of

I was able to create a dataset that was diverse enough that my model could annotate videos taken in different lighting conditions and different types of handwritten content like geometric shapes and formulas.

Reconstructing images from image crops in a temporal group proved to be a non-trivial task since the bounding boxes are at slightly different positions. I managed to accomplish it by first determining the coordinates of a reconstructed bounding box that encapsulates all bounding boxes within a group. The image crops from the bounding boxes were added to a sum matrix (which had the dimensions of the reconstructed bounding box) while simultaneously maintaining a matrix for number of times an addition is made at a particular pixel index. Using this count matrix and the sum matrix , an average is found out which is the reconstructed image.

What we learned

Development of python based GUI application.

The ease with which an object detection model can be trained with pytorch.

What's next for Whiteboard content summarizer

When the dataset for annotated video lectures with different colored boards becomes available, The project can be scaled to any kind of board by making minor tweaks in binarization section of the pipeline.

Camera movement can also be accounted for by changing the way we define spatial groups.

Log in or sign up for Devpost to join the conversation.

posted an update

At the time of submission due to some issues I wasn't able to write the description section for the project completely so I am posting it as an update.

How I built it

First the video is split into frames by sampling at 1fps


The frames are then annotated using a object detection model that I trained. Using facebook's detectron2 system I trained a Faster R-CNN model on a dataset of annotated whiteboard content. The dataset was prepared by selecting 550 annotated images from a public repository making sure that different lighting conditions were included. The model was trained for 1500 iterations after which it gave an average precision of 0.83 which was enough for the project because there are other techniques applied to make up for any errors. After annotation each image has a list of bounding boxes for written content on the board associated with it.


The frames are then binarized by first applying bilateral and median filter to them and generating a background mask. This mask is then subtracted from the image. This process gets rid of most of the background and the lecturer in a frame. The resulting image is then binarized using Otsu's binarization technique.

Creating Spatiotemporal groups:

The Spatial groups refer to parts of the written content that occupy the same space on the board. This is determined by putting all bounding boxes with IOU greater than 0.5 into the same spatial group. Bounding boxes in the same spatial groups are further split into temporal groups if there is a difference of 10 seconds between them.


There are errors associated with object detection , also there are frames in which the lecturer might obstruct the written content. To tackle this problem image crops from the bounding boxes in the same temporal groups are averaged to give one reconstructed image for every temporal group.


Another problem that arises is that same content can get split into different temporal groups if it is obstructed by the lecturer for over 10 seconds. This is rectified by comparing the reconstructed image for consecutive temporal groups using perceptual hashing. If the two images are similar the groups are merged.

Conflict minimization:

Finally to generate the summary I used the approach proposed by this paper. Any two temporal groups within the same spatial group are considered to be in conflict.A split interval is a time interval in which , if the video is split it resolves the conflict. All the split intervals are found out and then the interval which resolves the most conflict is chosen for splitting the video and a split index is recorded. This is done till there are no more conflicts. Then one summary image is generated for each interval marked by the split indices. A reconstructed image is added to the summary image if it has not already been added to any image or it exists for over 50% of the duration of the interval.

Log in or sign up for Devpost to join the conversation.