Logo

SenseAI — Giving Back What Silence Takes Away

Inspiration

Eleven million Americans wake up every day in silence. They miss the fire alarm. They miss someone calling their name. They miss the feeling of music at a concert when everyone around them is lost in the sound.

We found a Facebook post from a visually impaired woman who said the thing she missed most was not being able to see her grandchildren's faces. Another person said they missed the stars. These were not complaints about technology. They were quiet griefs about being left out of the human experience.

"I miss seeing peoples faces, my own independence, seeing nature, fall color change, birds, butterflies." — Sheila, visually impaired community member

That is what SenseAI is about. Not features. Not a demo. A genuine attempt to give something back.

What It Does

SenseAI is a unified iOS accessibility hub with three AI-powered modules:

Module	Who It Helps	What It Does
🤟 BridgeAI	Deaf and mute users	Translates ASL hand signs into text in real time using the phone camera
🚨 QuietAlert	Deaf and hard-of-hearing users	Detects critical sounds and delivers instant haptic alerts
🎵 HarmoniAI	Deaf music lovers	Separates songs into stems and maps each one to synchronized visuals and haptics

How We Built It

The Machine Learning Pipeline

We trained two custom neural networks entirely from scratch, following a full research-grade pipeline from raw data to on-device inference.

QuietAlert uses a ResNet-style CNN that learns to classify environmental sounds from mel spectrograms. A mel spectrogram converts raw audio into a 2D image where the x-axis is time, the y-axis is frequency on a perceptual scale, and brightness represents loudness. Our model takes an input of shape $[1, 1, 64, 216]$ representing one channel, 64 mel bands, and 216 time steps corresponding to 5 seconds of audio at 22,050 Hz with a hop length of 512 samples.

$$\text{frames} = \left\lfloor \frac{N - n_fft}{hop_length} \right\rfloor + 1 = \left\lfloor \frac{110250 - 1024}{512} \right\rfloor + 1 = 216$$

We trained on ESC-50 and UrbanSound8K using PyTorch on Google Colab Pro, then exported to Apple Core ML using coremltools with the spectrogram computation baked directly into the model so Swift only needs to pass raw audio samples.

BridgeAI takes a different approach. Rather than training a CNN on raw hand images, we use MediaPipe to extract 21 hand landmarks per frame, each with x, y, and z coordinates, giving us a 63-dimensional feature vector. We then train a Multi-Layer Perceptron on these normalized coordinates:

$$\vec{v}{normalized} = \frac{\vec{v} - \vec{v}{wrist}}{\max(|\vec{v} - \vec{v}_{wrist}|)}$$

This normalization makes the classifier invariant to hand position and size, so it works regardless of where the hand appears in frame or how far it is from the camera. The MLP architecture is $63 \rightarrow 256 \rightarrow 256 \rightarrow 128 \rightarrow 26$, achieving over 95% validation accuracy on the ASL alphabet.

HarmoniAI uses Demucs, Meta AI Research's state of the art source separation model, to isolate drums, bass, vocals, and melody from any song. We built a Google Colab pipeline that computes frame-accurate RMS energy envelopes at 43 fps and detects strong onsets using librosa, exporting everything as a structured JSON file. The iOS app then reads this data and drives a cinematic visual engine built in SwiftUI Canvas.

The iOS App

Everything runs natively on iPhone 15/16 using Swift and SwiftUI. The stack includes AVFoundation for audio and camera capture, Vision framework for real-time hand pose detection, Core ML with the Apple Neural Engine for model inference, Core Haptics for nuanced haptic patterns, the Accelerate framework for FFT-based audio processing, and SwiftUI Canvas with GraphicsContext for the HarmoniAI visual engine rendering comets, shockwaves, and aurora streaks at 60fps.

Challenges We Ran Into

The Training-Inference Contract

The hardest lesson we learned was about what we came to call the invisible contract between training and inference. When we first deployed QuietAlert to an iPhone, it detected nothing. The model was achieving over 90% accuracy in Python but was completely blind on the device.

The culprit was a subtle mismatch in mel filterbank parameters. librosa and torchaudio use slightly different defaults for normalization and mel scale type. A single parameter difference made the spectrograms look completely different to the model even though they sounded identical to a human ear. We rebuilt the entire audio preprocessing pipeline twice before the detections started working.

Coordinate System Mismatch in BridgeAI

Apple's Vision framework reports hand landmark coordinates with the Y-axis flipped relative to how the ASL training images were captured. On top of that, the front camera mirrors left and right. Without correcting both transformations, the model confidently predicted the wrong letter every single time. Getting both hands to work required:

Flipping Y coordinates: $y' = 1.0 - y$
Detecting hand chirality from Vision
Conditionally mirroring X for left hands: $x' = 1.0 - x$
Then applying wrist-relative normalization

The order of operations matters. Getting it wrong in any step produces nonsense predictions even though the model loaded successfully.

Accomplishments That We're Proud Of

We are proud that it works. Not in theory. On a real iPhone, in a real room, with real sounds and real hands. QuietAlert detects a dog barking and a fire alarm and sends a haptic to your wrist. BridgeAI reads B, A, B, Y from a hand it has never seen before. HarmoniAI plays a song and the screen erupts in light that moves with the music.

We are proud that every model runs entirely on device. No server. No internet connection required. No one ever sees your audio or your hands. For a population that has often had their privacy and dignity compromised by systems that were not built for them, that matters deeply.

What We Learned

We learned that accessibility is not a feature you add at the end. It is a decision you make at the beginning about who your product is for.

We learned that the gap between a model that works in a notebook and a model that works in someone's hand is enormous, and that gap is where most real engineering lives.

We learned that the deaf and hard-of-hearing community does not need pity. They need tools that actually work.

What's Next for SenseAI

Lip reading — the module is architected but not yet trained. We believe it could be the most impactful feature of all, since the majority of deaf people do not know ASL
Custom sound training in QuietAlert — let users record their own sounds so the app learns their specific doorbell, microwave, or baby's cry
Push notifications so QuietAlert works when the phone is across the room and the screen is off
HarmoniAI upload mode — a full server-side Demucs pipeline so any song can be processed and experienced, not just pre-processed demos

The only version of this project that matters is the one that makes someone's life genuinely better. We are not done yet.

Built With

avaudioengine
avfoundation
cadisplaylink
colab
corehaptics
coreml
coremltools
firebase
kaggle
python
pytorch
scikitlearn
swift
swiftui
torchaudio
vision
xcode

Updates

Vibhun Naredla started this project — Apr 26, 2026 12:53 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.