📘 Project Overview

Inspired by, Iron Man's virtual assistant, Jarvis is a Unity-based, AI-powered virtual character that listens to your voice, generates natural language responses using an LLM, detects the emotional tone behind the conversation, and physically reacts with facial expressions and body movement using a neural network.

By integrating speech recognition, emotional modeling, neural network prediction, and 3D animation, Jarvis aims to bring emotional intelligence into virtual assistants and, eventually, humanoid robotics.

💡 My Inspiration

All summer, I've been dreaming of building a network of neural networks in order to simulate a human like brain, and connect it to a robotics build. This Jarvis is the first step, with speech and emotion features. Throughout the summer I have been learning how to build neural networks and integrate LLM's into code, making this my final summer project. I am inspired my the creativity of films such as Big Hero Six and Iron Man, and I hope to one day take this system to the next level, adding more neural networks to expand its learning capacities and eventually add it to a mechatronic character.

🔧 Architecture

  • Once PVPorcupine picks up the wake phrase "Hey, Jarvis" from my microphone, it activates whisper
  • Whisper converts text to speech
  • This text is then sent via OpenRouter API to an LLM (Deepseek), in two different modules
  • The first generates a response phrase
  • This response then uses Edge TTS to read the response aloud
  • The second determines Jarvis's emotional response, and outputs the corresponding Arousal, Valence, Dominance values (AVD)
  • These values are then sent to a regression neural net I trained, that then outputs the corresponding joint angles of the Unity character to express that emotion.
  • Then through Flask API these angles are sent to the C# code connected to my rigged Jarvis character in Unity, causing the joint positions to update.

🔧 How It Was Made

  • Most code was written in python, with the exception of the C# code for the unity character animation
  • The robot character was found in spline, and rigging was done in blender. Final joint movement coding was done in Unity
  • Th neural network was custom made, and train on 10 000 pieces of data. After importing my rigging into Unity, I manually determined the min and max joint angles of each bone, then created parameters for how the character was supposed to move in order to express certain emotions. For example, more happier = head tilted up higher. I then fed these parameters into a Smart Synthetic Data Generator i built for a previous hackathon. In the generator the base data was created by an LLM, then was expanded by the CTGAN generator. For the neural network the data was first normalized before passing through training.

🤯 What Makes This Innovative

  • Emotion-Driven Movement: Most virtual assistants only speak. Jarvis reacts to how you're feeling.

  • AVD-to-3D Mapping: The system translates emotional metrics directly into animation parameters using a trained neural network.

  • Modular AI System: Each component (speech, emotion, motion) is independently swappable or trainable—paving the way for scalable neural networks in robotics.

  • Unity as a Real-Time Engine: Jarvis uses Unity not just for animation, but as a live emotional feedback renderer.

  • Extensible: Entire pipeline can be adapted to other characters, robots, or games.

💥 Challenges Faced

Real-Time Unity Integration: Ensuring live updates between neural network output in Python and C# code in Unity, proved challenging and required a Flask Server.

Natural Movement Generation: Mapping emotion into natural pose changes required fine-tuning the neural net architecture and extensive rig testing. In sufficient clean data lead to poor outputs, and required normalization before training. In earlier stages the Unity model was often being warped in strange ways.

Latency Optimization: With so many models (Whisper, LLM, emotion net, Blender), keeping response time fluid was a challenge.

API Limits: Working with rate-limited free APIs like OpenRouter and ElevenLabs required caching and fallback strategies.

🧠 What I Learned

Building this system gave me deep hands-on knowledge of neural networks, Blender rigging, and real-time emotion modeling.

I learned how to train regression networks to generate pose data from abstract inputs like emotion.

Most importantly, I learned how to connect multiple AI systems into a single believable character—bridging the gap between digital tools and real personality.

I also learned C# for this project.

🚀 Future Vision

Jarvis is just the beginning.

I plan to:

Create a second module that interprets voice commands into physical actions. For example, telling Jarvis to "Do a dance" would cause the character to actually perform this action.

Expand the brain of Jarvis into a network of modular neural networks: one for memory, one for planning, one for conversation, one for motion, etc.

Port these systems into a physical robotic arm or humanoid robot using Raspberry Pi, MG996R servos, and onboard inference.

Add vision systems using OpenCV and YOLO to allow Jarvis to recognize facial expressions, body language, and even objects in its environment.

Eventually, I aim to create a human-level AI companion—an emotionally intelligent assistant that learns, adapts, inspired by other characters such as Baymax.

Built With

Share this project:

Updates