Inspiration

Understanding human emotions accurately has become crucial for applications like mental health monitoring, personalized recommendations, and human-computer interaction. Traditional CNN-based models often struggle with capturing both local and global dependencies in facial features. This inspired us to design a hybrid architecture combining Convolutional Neural Networks (CNN) for local feature extraction and Vision Transformers (ViT) for global context understanding—achieving state-of-the-art performance in emotion recognition.

What it does

Our model analyzes facial images and predicts the underlying emotional state with 96.25% accuracy. It classifies emotions such as happy, sad, angry, neutral, stressed, and more in real time. The system is optimized for both speed and accuracy, making it suitable for real-world applications like smart assistants, mental health tools, emotion-driven music players, and customer experience analysis.

How we built it

Dataset: Curated and preprocessed a large-scale facial emotion dataset, applying techniques like histogram equalization, data augmentation, and normalization to improve generalization.

Architecture:

CNN Layers for extracting fine-grained facial details (edges, contours, expressions).

Vision Transformer (ViT) for capturing long-range dependencies and global context.

Combined them into a hybrid architecture with a classification head.

Training:

Optimized using AdamW optimizer and learning rate scheduling.

Applied early stopping and dropout to prevent overfitting.

Evaluation: Achieved 96.25% accuracy on the validation set, outperforming traditional CNN and standalone ViT models.

Challenges we ran into

Data Imbalance: Some emotions had significantly fewer samples, which required class rebalancing techniques.

Model Complexity: Combining CNN and ViT introduced architectural complexity, requiring hyperparameter tuning for stability.

Computational Resources: Training ViTs is resource-intensive, so we had to optimize the model size and batch processing for our hardware.

Accomplishments that we're proud of

Achieved 96.25% accuracy, which is higher than many existing emotion recognition benchmarks.

Successfully integrated CNN and ViT into a single pipeline without sacrificing speed or scalability.

Built a model that is deployable for real-world applications like AI assistants, emotion-driven content recommendation, and wellness platforms.

What we learned

How transformer-based architectures outperform traditional CNNs in capturing global context.

Importance of data augmentation and preprocessing in improving model robustness.

Strategies for optimizing large models under limited computational resources.

How hybrid architectures can achieve a balance between accuracy and efficiency.

What's next for the Emotion Recognition Model

Real-Time Deployment: Convert the model to TensorFlow Lite / ONNX for integration into mobile or edge devices.

Multimodal Emotion Detection: Combine facial expressions with voice and text analysis for better accuracy.

Dataset Expansion: Include cross-cultural and multi-environment datasets for more generalized performance.

Integration with Applications: Use in emotion-based music players, AI companions, therapy bots, and customer experience analytics.

Built With

Share this project:

Updates