SenseAI — Giving Back What Silence Takes Away
Inspiration
Eleven million Americans wake up every day in silence. They miss the fire alarm. They miss someone calling their name. They miss the feeling of music at a concert when everyone around them is lost in the sound.
We found a Facebook post from a visually impaired woman who said the thing she missed most was not being able to see her grandchildren's faces. Another person said they missed the stars. These were not complaints about technology. They were quiet griefs about being left out of the human experience.
"I miss seeing peoples faces, my own independence, seeing nature, fall color change, birds, butterflies." — Sheila, visually impaired community member
That is what SenseAI is about. Not features. Not a demo. A genuine attempt to give something back.
What It Does
SenseAI is a unified iOS accessibility hub with three AI-powered modules:
| Module | Who It Helps | What It Does |
|---|---|---|
| 🤟 BridgeAI | Deaf and mute users | Translates ASL hand signs into text in real time using the phone camera |
| 🚨 QuietAlert | Deaf and hard-of-hearing users | Detects critical sounds and delivers instant haptic alerts |
| 🎵 HarmoniAI | Deaf music lovers | Separates songs into stems and maps each one to synchronized visuals and haptics |
How We Built It
The Machine Learning Pipeline
We trained two custom neural networks entirely from scratch, following a full research-grade pipeline from raw data to on-device inference.
QuietAlert uses a ResNet-style CNN that learns to classify environmental sounds from mel spectrograms. A mel spectrogram converts raw audio into a 2D image where the x-axis is time, the y-axis is frequency on a perceptual scale, and brightness represents loudness. Our model takes an input of shape $[1, 1, 64, 216]$ representing one channel, 64 mel bands, and 216 time steps corresponding to 5 seconds of audio at 22,050 Hz with a hop length of 512 samples.
$$\text{frames} = \left\lfloor \frac{N - n_fft}{hop_length} \right\rfloor + 1 = \left\lfloor \frac{110250 - 1024}{512} \right\rfloor + 1 = 216$$
We trained on ESC-50 and UrbanSound8K using PyTorch on Google Colab Pro, then exported to Apple Core ML using coremltools with the spectrogram computation baked directly into the model so Swift only needs to pass raw audio samples.
BridgeAI takes a different approach. Rather than training a CNN on raw hand images, we use MediaPipe to extract 21 hand landmarks per frame, each with x, y, and z coordinates, giving us a 63-dimensional feature vector. We then train a Multi-Layer Perceptron on these normalized coordinates:
$$\vec{v}{normalized} = \frac{\vec{v} - \vec{v}{wrist}}{\max(|\vec{v} - \vec{v}_{wrist}|)}$$
This normalization makes the classifier invariant to hand position and size, so it works regardless of where the hand appears in frame or how far it is from the camera. The MLP architecture is $63 \rightarrow 256 \rightarrow 256 \rightarrow 128 \rightarrow 26$, achieving over 95% validation accuracy on the ASL alphabet.
HarmoniAI uses Demucs, Meta AI Research's state of the art source separation model, to isolate drums, bass, vocals, and melody from any song. We built a Google Colab pipeline that computes frame-accurate RMS energy envelopes at 43 fps and detects strong onsets using librosa, exporting everything as a structured JSON file. The iOS app then reads this data and drives a cinematic visual engine built in SwiftUI Canvas.
The iOS App
Everything runs natively on iPhone 15/16 using Swift and SwiftUI. The stack includes AVFoundation for audio and camera capture, Vision framework for real-time hand pose detection, Core ML with the Apple Neural Engine for model inference, Core Haptics for nuanced haptic patterns, the Accelerate framework for FFT-based audio processing, and SwiftUI Canvas with GraphicsContext for the HarmoniAI visual engine rendering comets, shockwaves, and aurora streaks at 60fps.
Challenges We Ran Into
The Training-Inference Contract
The hardest lesson we learned was about what we came to call the invisible contract between training and inference. When we first deployed QuietAlert to an iPhone, it detected nothing. The model was achieving over 90% accuracy in Python but was completely blind on the device.
The culprit was a subtle mismatch in mel filterbank parameters. librosa and torchaudio use slightly different defaults for normalization and mel scale type. A single parameter difference made the spectrograms look completely different to the model even though they sounded identical to a human ear. We rebuilt the entire audio preprocessing pipeline twice before the detections started working.
Coordinate System Mismatch in BridgeAI
Apple's Vision framework reports hand landmark coordinates with the Y-axis flipped relative to how the ASL training images were captured. On top of that, the front camera mirrors left and right. Without correcting both transformations, the model confidently predicted the wrong letter every single time. Getting both hands to work required:
- Flipping Y coordinates: $y' = 1.0 - y$
- Detecting hand chirality from Vision
- Conditionally mirroring X for left hands: $x' = 1.0 - x$
- Then applying wrist-relative normalization
The order of operations matters. Getting it wrong in any step produces nonsense predictions even though the model loaded successfully.
Accomplishments That We're Proud Of
We are proud that it works. Not in theory. On a real iPhone, in a real room, with real sounds and real hands. QuietAlert detects a dog barking and a fire alarm and sends a haptic to your wrist. BridgeAI reads B, A, B, Y from a hand it has never seen before. HarmoniAI plays a song and the screen erupts in light that moves with the music.
We are proud that every model runs entirely on device. No server. No internet connection required. No one ever sees your audio or your hands. For a population that has often had their privacy and dignity compromised by systems that were not built for them, that matters deeply.
What We Learned
We learned that accessibility is not a feature you add at the end. It is a decision you make at the beginning about who your product is for.
We learned that the gap between a model that works in a notebook and a model that works in someone's hand is enormous, and that gap is where most real engineering lives.
We learned that the deaf and hard-of-hearing community does not need pity. They need tools that actually work.
What's Next for SenseAI
- Lip reading — the module is architected but not yet trained. We believe it could be the most impactful feature of all, since the majority of deaf people do not know ASL
- Custom sound training in QuietAlert — let users record their own sounds so the app learns their specific doorbell, microwave, or baby's cry
- Push notifications so QuietAlert works when the phone is across the room and the screen is off
- HarmoniAI upload mode — a full server-side Demucs pipeline so any song can be processed and experienced, not just pre-processed demos
The only version of this project that matters is the one that makes someone's life genuinely better. We are not done yet.
Log in or sign up for Devpost to join the conversation.