Inspiration

Utility:

  • In schools, this system can be used over CCTV footages to identify cases of bullying, alienation and ragging among students.
  • In offices/factories/warehouses, a similar analysis on the CCTV videos can help gauge who can be potential collaborators/teams based on their positive emotional chemistry.
  • For people in academia who intend to understand the social dynamics of characters in media.
  • It's more like a by-product of our desire to work with Azure services which we have exploited extensively.

What it does

Given a video from the TV series F.R.I.E.N.D.S, the system caters to queries like:

  • Which all characters are present in each video frame?
  • What is the emotional profile of each character when they are interacting with each other?
  • How does a user’s emotional profile impact the relationship among the characters?
  • What is the flow of sentiment across the timeline?

How we built it

  • Using Azure’s Custom Vision service, we trained a classifier to identify the lead characters of the T.V. series.
  • Using OpenCV, extracted each frame of the video.
  • Using Azure’s Face Api, detected faces in each frame. The service returns the coordinates of each face and its respective expression.
  • Next, we feed the detected faces into our Custom Vision classifier to map each face to one of the characters.
  • In parallel, we extracted audio from video using ffmeg (Python). The audio file is further fragmented using the heuristic that an average human speaks an English sentence in approximately 5 secs.
  • Using Azure’s Speech api, generated transcript for each audio file.
  • Using Azure’s Sentiment service, extracted the sentiments associated with each dialogue.
  • Aggregated the information to answer desired queries.
  • Developed a Javascript based front-end to visualise the results.

Challenges we ran into

  • Generating training data for the image classifier.
  • The T.V. series F.R.I.E.N.D.S lasted for 10 years. The characters underwent significant physical changes leading to a more complex model. Solution: active learning.
  • Custom Vision api doesn’t take any input from the user that which region maps to what tag. Hence, in the case of character identification, the model was using non-facial features like body structure etc.
  • Nobody in the team has experience in front-end development.

Accomplishments that we're proud of

  • Developed a highly accurate classifier for identifying characters.
  • Being able to collaborate with people with similar passions
  • Finishing a decently large project without falling asleep

What we learned

  • The coolness and accuracy of Azure apis.
  • How to integrate various front-end and back-end technologies into a coherent practical project
  • How to utilise the strengths of each team member to make the most out of the hackathon

What's next for ViCSoN: Video-based Character Social Network

  • Using Azure’s Speaker Recognition to map dialogues to characters.
  • Exploration of semi-supervised learning to boost the training process.
  • Try a similar analysis on longer length videos and CCTV footages.
  • Apification of backend services
Share this project:

Updates