Inspiration

Virtual meetings remove many of the subtle social cues that help people understand how a conversation is going. Facial expressions, tone shifts, and conversational pacing are often harder to interpret through a screen. For individuals who struggle with reading emotions—such as people with social communication difficulties or those on the autism spectrum—this can make online conversations especially challenging.

We wanted to build a tool that acts as a real-time emotional interpreter during video calls, helping users better understand how the other person might be feeling and how to respond appropriately.

Our goal with EmotiScan was to make digital communication more accessible, intuitive, and socially navigable.

What it does

EmotiScan is a Google Meet extension that analyzes facial expressions, voice tone, and conversational dynamics to estimate the emotional state of the person you are speaking with.

The system then provides live suggestions to help guide the user's response and maintain a smoother conversation.

Key features include:

Facial expression recognition to detect emotional signals

Audio tone analysis to identify stress, excitement, or frustration

Conversation metrics such as speaking time and word output

Real-time response suggestions to guide tone and phrasing

Compact Mode for minimal real-time feedback

Full Mode for deeper conversational insights

Post-call summaries that highlight emotional patterns and provide improvement suggestions for future conversations

After each meeting, EmotiScan generates a conversation report summarizing emotional shifts and communication balance.

How we built it

EmotiScan was built entirely in Python, powering both the backend processing and the core analysis pipeline. The system combines computer vision, audio emotion recognition, and conversational analysis to estimate emotional signals during a video call and provide real-time feedback.

Facial Expression Recognition

For facial emotion detection, we used YOLOv8 with PyTorch to process video frames and detect faces in real time. The model analyzes facial features to estimate expressions such as happiness, confusion, frustration, or neutrality. YOLOv8 allowed us to perform fast detection while maintaining the speed required for live analysis.

Audio Emotion Detection

To analyze emotional tone in speech, we trained a model using scikit-learn on voice emotion datasets from Kaggle. We extracted acoustic features such as pitch, energy, and speech patterns, which the model uses to estimate emotional tone during the conversation.

Conversation Analysis

In addition to facial and vocal signals, EmotiScan tracks conversation dynamics like speaking time, word output, and response length. These metrics help identify whether the conversation is balanced or if one participant may be dominating or disengaging.

Real-Time Feedback

All signals are combined to estimate the overall emotional context of the conversation. During the call, users receive feedback through two interface modes:

Compact Mode for simple real-time cues

Full Mode for deeper conversational insights

After the call ends, EmotiScan generates a summary report highlighting emotional patterns and suggestions for improving future conversations.

Challenges we ran into

One of the biggest challenges was interpreting emotions accurately in real time. Facial expressions alone can be misleading, so we needed to combine multiple signals.

Other challenges included:

Processing video frames quickly enough for live feedback

Extracting useful audio features in noisy environments

Designing a UI that is helpful but not distracting during a call

Handling cases where emotions are ambiguous or mixed

Balancing speed, accuracy, and usability was one of the hardest parts of the project.

Accomplishments that we're proud of

With EmotiScan, we created a real-time Google Meet extension that helps users interpret emotional cues during video calls—something especially useful for individuals who struggle with reading emotions or social context.

We successfully combined facial expression recognition, voice tone analysis, and conversational metrics to generate actionable feedback during live calls. Users can see compact, simple cues in real time or dive into a full mode for deeper insights. After each call, EmotiScan produces a summary report, highlighting emotional trends and offering recommendations to improve future interactions.

Through this project, we demonstrated the ability to:

Build a multimodal AI system integrating computer vision, audio analysis, and conversation tracking.

Provide live, context-aware feedback to guide social interactions.

Deliver a polished, usable interface within a real-world platform (Google Meet).

Ultimately, we accomplished more than just detecting emotions—we made digital communication more accessible and intuitive, giving users tools to navigate conversations confidently and empathetically.

What we learned

Through building EmotiScan we learned:

How to integrate computer vision and audio analysis pipelines

How emotional signals can be multimodal, combining facial and vocal cues

The challenges of building real-time AI systems

The importance of designing AI tools that assist rather than overwhelm users

We also gained experience building browser extensions and live feedback systems for real-world applications.

What's next for EmotiScan

Future improvements could include:

Personalized emotion models that adapt to specific users

Support for more video platforms

Better contextual NLP analysis of conversation content

Training models on larger emotional datasets

Our long-term vision is to make EmotiScan a tool that helps people build stronger communication skills and navigate digital conversations with confidence.

Built With

Share this project:

Updates