Inspiration
There is a child in our neighborhood who has autism. He communicates, reacts, and feels, but the adults around him often have to guess. Happy, scared, overwhelmed, or something else entirely. Sometimes the guess is wrong, and the support that was already there never lands.
That gap is why we built Heard. Not to fix the child and not to replace real conversation, but to give caregivers a clearer signal when words and faces are hard to read.
Generic emotion AI makes this harder. Models trained on typical speech can sound confident while being wrong on atypical expression. We wanted the opposite. Honest scores, abstention when unsure, and a path toward learning each child as an individual.
What it does
Heard is a voice first emotion reader aimed at the USAII Make Support Obvious challenge.
Muhammad's stack (audio wearable path)
- Microphone captures pitch, energy, rhythm, and how the voice changes over time, not the words themselves
- A CNN + GRU model trained with face guided learning on autism relevant data (MobileNetV2 teacher on FER Autism, audio student at inference)
- ~81% test accuracy on five emotion classes, speaker independent split
- Exported to ONNX for edge use, with a hardware sketch (Arduino + 8×8 LED matrix) to show emotion patterns
Cora's stack (research and fusion path)
- wav2vec2 speech emotion baseline, 64% speaker independent on RAVDESS (chance 12.5%)
- Per child calibration with a few samples per emotion, measured ~67% → ~77%, with the hardest speakers gaining the most (~+19 points)
- Multimodal demo on video clips with sound. Face branch (~66.4% on FER Autism six classes) fused with voice when both are present
- Abstention when confidence is low instead of guessing
- Live Colab so judges can try mic or upload without installing anything
Together this is one project with two runnable demos that meet in the same idea. Understand the child first, then support can actually arrive.
Try it
- Team repo (Muhammad) https://github.com/MuhammadBinary/HeardAutism---USAII-Muhd-Cora-
- Fusion + personalization repo (Cora) https://github.com/xqscora/usaii-autism-emotion
- Colab https://colab.research.google.com/github/xqscora/usaii-autism-emotion/blob/master/demo_colab.ipynb
How we built it
Data (public only)
- RAVDESS for labeled speech emotion
- FER Autism and FER2013 for face emotion
- ASDSpeech features for autism speech distribution (no emotion labels, useful for future domain work)
Muhammad
- Built MFCC, zero crossing rate, and RMS feature pipeline
- Trained CNN + GRU audio model with auxiliary face supervision during training
- Evaluated speaker independently and exported inference_model.onnx
- Documented wearable flow mic → laptop/edge → ONNX → Arduino LED
Cora
- Trained wav2vec2 embedding + logistic regression SER baseline with held out speakers
- Wrote personalization and sweep scripts to show gains from K samples per emotion
- Built demo_emotion.py, demo_face_emotion.py, and demo_multimodal.py for audio only, face only, and fused video
- Added Colab notebook with mic, upload, and optional webcam paths
Demo video
- Combined slide walkthrough, Muhammad's audio story, Cora's personalization results, and a live multimodal clip (
recording/Heard_USAII_demo_v1.mp4in Cora's repo)
Challenges we ran into
Almost no public datasets label autistic children's emotional speech. We had to be upfront that emotion labels mostly come from neurotypical adults while autism specific voice stays under labeled.
Cross modal training helped audio learn richer emotion cues, but deployment stays audio only by design so the wearable stays simple and privacy friendly.
Personalization needs a few labeled samples per child, which is realistic for a caregiver assisted setup but not magic. We focused on showing that the speakers generic models fail on are exactly where personalization helps.
Hardware was limited on one teammate's side (laptop and phone issues), so we leaned on Colab, ONNX, and a screen recorded demo with voiceover instead of a live on stage device.
Accomplishments that we're proud of
- End to end audio emotion pipeline with strong held out test numbers and ONNX export
- Face guided training that stays audio only at inference
- Personalization curve that improves with small K and helps the worst cases most
- Multimodal fusion demo on real video with both face and voice
- Abstaining instead of confident wrong labels
- Runnable Colab for judges and teammates who cannot install locally
What we learned
Support fails when understanding fails first. A slightly uncertain but honest read beats a wrong label delivered with full confidence.
Better AI for this community is not always louder AI. Sometimes the most useful output is "I am not sure, please check in with them."
Every child expresses differently. A generic model is a starting point. A model that learns this child is the direction we care about (we frame that long term work as Cerome).
What's next for Heard
- More autism labeled speech if dataset requests come through
- Deeper domain adaptation on ASDSpeech
- Smaller edge board running the ONNX model with the LED wearable prototype
- Per child memory over time instead of one shot calibration
- Caregiver in the loop UI so every read can be confirmed or corrected
Built With
- arduino
- jupyter-colab
- librosa
- onnx
- opencv
- python
- pytorch
- scikit-learn
- tensorflow
- transformers-(wav2vec2)
Log in or sign up for Devpost to join the conversation.