Inspiration

Human–AI interaction often breaks down not because the AI is incapable, but because it cannot correctly read what a person means or how they feel. Gestures alone tell only half the story. Someone can perform the same gesture with very different emotional contexts. Many RL agents rely on predefined reward functions that fail to capture human preference or comfort.

We were inspired by the idea of creating an AI system that learns the way humans teach: through both actions and emotions. By combining gesture recognition with facial affect, we wanted to build a model that doesn't just classify what a user is doing, but adapts to how the user reacts. This led us to explore multimodal feedback and emotion-driven reinforcement learning as a way to make AI systems more intuitive, responsive, and aligned with human intent.

What it does

We have created a multimodal human–AI interaction system that learns from both what a user does and how they feel. The pipeline takes an image of a hand gesture, classifies the gesture, and then analyzes the user's facial expression to determine their emotional reaction. These two signals feed into a reinforcement learning agent: the predicted gesture becomes the agent’s action, while the detected emotion becomes the reward signal. Positive feedback for smiles, negative for frowns, and neutral otherwise. Over time, the RL agent learns to interpret and respond to gestures in ways that maximize the user's emotional approval, allowing the system to adapt its behavior to individual users and create a more intuitive, emotionally aligned interaction experience.

How we built it

Trained three separate models:

  1. CNN classifier
  2. Valence regression model
  3. RL reinforce model Then we connected all three models and use simulated facial expressions to reward model behavior.

Challenges we ran into

  • Dealing with noisy datasets, as facial datasets are inherently data
  • Dealing with large amounts of data (the CNN dataset was ~132 GB large)
  • Narrowing down reasons for low model accuracy

Accomplishments that we're proud of

  • Getting the model up and running!
  • High CNN classifier accuracy (~90%)
  • Our teamwork :)

What we learned

  • How to use remote servers such as Oscar to train models
  • How to integrate different models

What's next for Affect-Driven Gesture RL

  • Dealing with video data (currently we trained the model on static images) for real-time model feedback

Built With

Share this project:

Updates