Affect-Driven Gesture RL

Inspiration

Human–AI interaction often breaks down not because the AI is incapable, but because it cannot correctly read what a person means or how they feel. Gestures alone tell only half the story. Someone can perform the same gesture with very different emotional contexts. Many RL agents rely on predefined reward functions that fail to capture human preference or comfort.

We were inspired by the idea of creating an AI system that learns the way humans teach: through both actions and emotions. By combining gesture recognition with facial affect, we wanted to build a model that doesn't just classify what a user is doing, but adapts to how the user reacts. This led us to explore multimodal feedback and emotion-driven reinforcement learning as a way to make AI systems more intuitive, responsive, and aligned with human intent.

What it does

We have created a multimodal human–AI interaction system that learns from both what a user does and how they feel. The pipeline takes an image of a hand gesture, classifies the gesture, and then analyzes the user's facial expression to determine their emotional reaction. These two signals feed into a reinforcement learning agent: the predicted gesture becomes the agent’s action, while the detected emotion becomes the reward signal. Positive feedback for smiles, negative for frowns, and neutral otherwise. Over time, the RL agent learns to interpret and respond to gestures in ways that maximize the user's emotional approval, allowing the system to adapt its behavior to individual users and create a more intuitive, emotionally aligned interaction experience.

How we built it

Trained three separate models:

CNN classifier
Valence regression model
RL reinforce model Then we connected all three models and use simulated facial expressions to reward model behavior.