Inspiration
Virtual meetings remove many of the subtle social cues that help people understand how a conversation is going. Facial expressions, tone shifts, and conversational pacing are often harder to interpret through a screen. For individuals who struggle with reading emotions—such as people with social communication difficulties or those on the autism spectrum—this can make online conversations especially challenging.
We wanted to build a tool that acts as a real-time emotional interpreter during video calls, helping users better understand how the other person might be feeling and how to respond appropriately.
Our goal with EmotiScan was to make digital communication more accessible, intuitive, and socially navigable.
What it does
EmotiScan is a Google Meet extension that analyzes facial expressions, voice tone, and conversational dynamics to estimate the emotional state of the person you are speaking with.
The system then provides live suggestions to help guide the user's response and maintain a smoother conversation.
Key features include:
Facial expression recognition to detect emotional signals
Audio tone analysis to identify stress, excitement, or frustration
Conversation metrics such as speaking time and word output
Real-time response suggestions to guide tone and phrasing
Compact Mode for minimal real-time feedback
Full Mode for deeper conversational insights
Post-call summaries that highlight emotional patterns and provide improvement suggestions for future conversations
After each meeting, EmotiScan generates a conversation report summarizing emotional shifts and communication balance.
How we built it
EmotiScan was built entirely in Python, powering both the backend processing and the core analysis pipeline. The system combines computer vision, audio emotion recognition, and conversational analysis to estimate emotional signals during a video call and provide real-time feedback.
Facial Expression Recognition
For facial emotion detection, we used YOLOv8 with PyTorch to process video frames and detect faces in real time. The model analyzes facial features to estimate expressions such as happiness, confusion, frustration, or neutrality. YOLOv8 allowed us to perform fast detection while maintaining the speed required for live analysis.
Audio Emotion Detection
To analyze emotional tone in speech, we trained a model using scikit-learn on voice emotion datasets from Kaggle. We extracted acoustic features such as pitch, energy, and speech patterns, which the model uses to estimate emotional tone during the conversation.
Conversation Analysis
In addition to facial and vocal signals, EmotiScan tracks conversation dynamics like speaking time, word output, and response length. These metrics help identify whether the conversation is balanced or if one participant may be dominating or disengaging.
Real-Time Feedback
All signals are combined to estimate the overall emotional context of the conversation. During the call, users receive feedback through two interface modes:
Compact Mode for simple real-time cues
Full Mode for deeper conversational insights
After the call ends, EmotiScan generates a summary report highlighting emotional patterns and suggestions for improving future conversations.
Challenges we ran into
One of the biggest challenges was interpreting emotions accurately in real time. Facial expressions alone can be misleading, so we needed to combine multiple signals.
Other challenges included:
Processing video frames quickly enough for live feedback
Extracting useful audio features in noisy environments
Designing a UI that is helpful but not distracting during a call
Handling cases where emotions are ambiguous or mixed
Balancing speed, accuracy, and usability was one of the hardest parts of the project.
Accomplishments that we're proud of
With EmotiScan, we created a real-time Google Meet extension that helps users interpret emotional cues during video calls—something especially useful for individuals who struggle with reading emotions or social context.
We successfully combined facial expression recognition, voice tone analysis, and conversational metrics to generate actionable feedback during live calls. Users can see compact, simple cues in real time or dive into a full mode for deeper insights. After each call, EmotiScan produces a summary report, highlighting emotional trends and offering recommendations to improve future interactions.
Through this project, we demonstrated the ability to:
Build a multimodal AI system integrating computer vision, audio analysis, and conversation tracking.
Provide live, context-aware feedback to guide social interactions.
Deliver a polished, usable interface within a real-world platform (Google Meet).
Ultimately, we accomplished more than just detecting emotions—we made digital communication more accessible and intuitive, giving users tools to navigate conversations confidently and empathetically.
What we learned
Through building EmotiScan we learned:
How to integrate computer vision and audio analysis pipelines
How emotional signals can be multimodal, combining facial and vocal cues
The challenges of building real-time AI systems
The importance of designing AI tools that assist rather than overwhelm users
We also gained experience building browser extensions and live feedback systems for real-world applications.
What's next for EmotiScan
Future improvements could include:
Personalized emotion models that adapt to specific users
Support for more video platforms
Better contextual NLP analysis of conversation content
Training models on larger emotional datasets
Our long-term vision is to make EmotiScan a tool that helps people build stronger communication skills and navigate digital conversations with confidence.
Built With
- chrome
- computer-vision
- flask
- gemini-api
- google-gemini
- google-genai
- google-meet
- kaggle
- machine-learning
- opencv
- python
- pytorch
- scikit-learn
- yolov8
Log in or sign up for Devpost to join the conversation.