How such idea came into life?

It just came into my mind while I was sitting at the opening ceremony, and I thought it would be a cool challenge (along with good applications, too) to do with my friends.

How does this work?

The user provides us a video link through our website. Then, the algorithm will generate a set of scenes where each scene contains words that describe it; the words are ordered in a descending order of relevance within the associated scene. However, our ultimate goal is to have one paragraph that describes the entire video, let's say, the theme.

How was it done, or at least the pre-alpha version?

We break the video down into frames separated by n frames which is specified. Each frame is fed into the Google Cloud Vision API to detect features, and for each, we have a list of words that describe it. After that, we group frames that share common features together. The procedure was, each frame is in its own "scene," then two adjacent scenes are compared, and their error (the amount of mismatch between words) is calculated. If the error is significant, then it is definitely a scene change; otherwise, it is only a relative change (camera movement, illumination, ...), thus, combining the two scenes together. This process will go until the end of the list of scenes.

Challenges I ran into

Python was new to some members of the group, and the challenge is always putting things together and dividing them again into reusable modules. However, most of the obstacles are overcome. Our goal was met, but there were lots of issues (or ideas, I would say) on Github that we yet solved.

Accomplishments that I'm and we're proud of

Bringing everyone's codes together and rewrite to create clean and easy-to-read scripts. For us, we were happy that we met our goal on time.

What we learned

We learned to work as a team, communicate, and support each other when needed. We also saw the reality of programming where many problems arise; it is fairly different from studying class since the homework are well-structured and easy to understand.

Semantic bugs!!

  • Outliers i.e. logos or watermarks might appear in the video and be the dominant words for each scene. This will create problems when we get to stage of transferring groups of words into context, or a sentence since the outliers don't belong. However, a simple counting for a maximum and eliminating it won't work due to the fact that a common setting i.e. sand (if shot in a desert), green or tree (if shot in a forest) will be eliminated if such would happen, then it is an irony that words that describe the entire video get removed from the sets.
  • Scenes are not actually scenes at the moment: since we only have one level of grouping which is putting frames together. We cannot really call that group a scene yet since shots within a scene can differ too! Thus, several more layers of processing should be taken into account until a number amount of scenes is reached to stop.
  • How many frames mean good interval? Some action movies might require a small amount of frames step, but slow movie might prefer a bigger amount.
  • How much for threshold to decide whether two scenes, or shots, or frames, are significant different such that they are two completely unique scenes?


There are plenty of applications where we can expand our project to offer:

  • Videos' descriptions can be auto-generated for blind people.
  • Detect videos for copyright concerns.
  • Create auto-generated tags for Youtube since clickbaits (when title and video don't correlate) happen very often now. As far I know, the tags that Youtube uses at the moment are specified by the users. Thus, we can kind of eliminate this kind of situation.
  • Give a rough auto-generated description about an inputted movie.
  • And many more,... These are as far as I can imagine.
Share this project: