Phone With Hands

Inspiration

I have a cousin who cannot hear and has difficulty speaking, so communicating with her, where truly understanding what she means and how she feels, is always really difficult. When I moved to the States, I realized the problem is much worse for her community as so much of daily life runs through a phone call. Booking a doctor's appointment, calling a pharmacy, reaching customer service, sorting out an issue with a bank, where the answer is always "just give us a call."

But for people who have disabilities with hearing or speaking, a phone call is exactly the thing they can't easily do. Even with relay services or texting, something essential gets lost since there's no quick way to speak in their own voice, in real time, expressing their emotion and tone that makes a conversation feel human. They're often forced to depend on someone else to make the call for them, or have to communicate with people that cannot understand their own "voice" or language.

Phone With Hands started with a simple, personal question: what if my cousin could pick up the phone, sign naturally, and the person on the other end would simply hear her with her own words, tone, and all?

What it does

Phone With Hands lets people who use sign language make and take real phone calls — live, in both directions.

Sign → Speech: You sign into the camera, and the app turns your signs into a natural sentence and speaks it aloud to the other person in a real, expressive voice with meaningful tone — so you're actually talking, not just sending text.
Speech → Sign: When they reply, the app instantly shows you the meaning, tone, and key info of what they said, and renders the response as ASL gloss with a 3D signing avatar, so the conversation comes back visually rather than as plain text.
A real call: Your phone is the actual handset (mic + speaker) while the app handles the camera, translation, and avatar. It also includes a guided doctor's-appointment scenario, a free testing ground, quick-phrase buttons, and live captions.
In short: it gives people who sign a way to talk on the phone in their own voice and emotion, and to receive replies back visually through sign — independently and in real time.

How I built it

Phone With Hands is a Next.js (React + TypeScript) web app combining on-device computer vision, a lightweight classifier, an AI "brain," and a real-time audio bridge.

Sign recognition (on-device): MediaPipe Hands extracts 21 hand landmarks from the webcam in the browser (video never leaves the device), fed into a custom KNN classifier I wrote. I built an in-browser trainer and recorded ~1,400 samples across 18 signs as a seed model.
Customized handshapes: For sign → speech I deliberately trained simplified, distinct handshapes instead of motion-heavy ASL — single-frame recognition can't handle real ASL's movement/two hands, so custom shapes train fast and run reliably. (Authentic ASL is reserved for the avatar output.)
The brain (ASI:1): ASI:1 (Fetch.ai) sits in the middle — turning choppy sign-gloss into natural, tone-aware sentences, and distilling caller speech into meaning, tone, and key info.
Voice: ElevenLabs for expressive TTS and Scribe STT, with the browser's Web Speech API as a free fallback.
3D avatar: A Three.js / React-Three-Fiber Mixamo avatar signs replies from the gloss, with text shown alongside.
Real phone handset: A custom Node WebSocket bridge + ngrok turns a phone into the call's mic/speaker (16 kHz PCM streaming) while the Mac runs the camera + avatar.
AI tools: As a solo dev and fairly new hacker, I used Claude Code for heavy feature builds and Simular's Sai agent for orchestration, debugging, and infrastructure — tight-scoped tasks, type-checked, one commit per phase. Reliability by design: Every link has a fallback (rule-based AI, Web Speech, typed input, text under the avatar) so a live call never hard-fails.

Challenges I ran into

Animating real ASL on the 3D avatar: This was the hardest part. My first approach tried to retarget live MediaPipe hand-tracking onto the avatar's 3D skeleton, but the mapping kept dropping the arms off-screen ("invisible hands"), and two-handed signs broke because the left hand was barely tracked. I made a pragmatic call to drive the avatar with clean Mixamo animations instead — realistically, only HELLO is fully animated (a wave) right now, while every other reply falls back to ASL gloss + text.
Single-frame recognition can't see motion: Real ASL depends on movement, location, and two hands, but my KNN classifier only sees one frame at a time. I worked around it with custom static handshapes and a tiered strategy to keep recognition fast and reliable.
Solo two-handed capture for training model: I couldn't hold a key while signing with both hands, so I built a hands-free Enter-countdown auto-capture for the trainer.
Real calls without Twilio: Turning a phone into a real handset meant building a WebSocket audio bridge and getting around iOS Safari's strict secure-origin requirement for mic access — solved with an ngrok tunnel and a single-origin proxy.

Accomplishments that I am proud of

A genuinely two-way ASL ↔ speech call that runs live. Not a one-direction gimmick — you sign and they hear you, they speak and you see it come back as sign. Both directions work in real time.
A real phone handset, not a screen demo. I turned an actual iPhone into the call's mic and speaker over my own WebSocket bridge — proving the conversation happens on a real phone, without needing Twilio or any telephony account.
Expressive, human-sounding speech. The user's voice comes through with real tone and emotion via ElevenLabs, instead of flat robotic text-to-speech — which is the whole point of letting someone talk expressively.
A from-scratch, in-browser sign recognizer. I built my own trainer and KNN classifier on top of MediaPipe, trained ~1,400 samples across 18 signs, and bundled it as a seed model that loads instantly on any machine — all on-device, with no video ever leaving the camera.
Shipping this much, solo, in a weekend. Combining computer vision, an AI language brain, voice, a 3D avatar, and live telephony into one working app as a single developer — by orchestrating AI tools effectively rather than cutting scope.
Built to never break. Layered fallbacks at every step mean the app keeps working even when a mic, model, or API isn't available.

What I learned

How to actually orchestrate AI tools and not just prompt them: The biggest lesson was learning to use AI development tools efficiently: scoping tight, well-defined tasks, verifying every change with type-checks, committing per phase, and knowing when to let Claude Code build heavy features versus when to make a surgical fix myself. Used well, AI tools let one person ship the surface area of a whole team.
Computer vision in the browser. I learned how hand-landmark detection works with MediaPipe, how to turn raw landmarks into normalized, position-invariant features, and why on-device inference matters for both latency and privacy.
Training a light, practical model: Building my own KNN classifier taught me that the right-sized model often beats the fanciest one — a simple, explainable classifier I could train live in minutes was far more reliable for a demo than a heavy deep model, as long as I designed the input (clean, distinct handshapes) to play to its strengths.
Real-time audio is hard: Streaming microphone audio between two devices taught me about sampling rates, PCM downsampling, voice-activity detection, and the strict security rules browsers enforce around microphone access.
Design for failure: Working on accessibility software made it concrete that reliability beats novelty — every feature needs a graceful fallback, because the people who'd actually depend on this can't afford for it to break. -Setting up the technical tools and basics: I got hands-on with the unglamorous-but-essential plumbing: configuring a Next.js project and dev environment, managing API keys and environment variables securely (.env.local, server-side proxy routes so secrets never reach the client), understanding API key permissions and scopes, wiring up multiple third-party APIs (ElevenLabs, ASI:1), and standing up local servers + an ngrok tunnel for cross-device testing.

What's next for Phone With Hands

Real phone calls (PSTN): Integrate Twilio so the app can dial any real phone number — not just a paired device — and work as a true relay replacement for everyday calls.
Authentic, motion-aware sign language: Move beyond single-frame custom handshapes to temporal models (LSTM / MediaPipe Holistic) that recognize real ASL with movement, location, and two hands.
More and richer training data: Expand the vocabulary far beyond 18 signs with a larger, more diverse dataset — different signers, lighting, and camera angles — for robust real-world accuracy.
A fully signing 3D avatar: Build out the avatar so it can sign complete responses (not just HELLO), bringing authentic ASL to the speech → sign direction.
On-device and private: Move more of the pipeline on-device for fully private, offline-capable calls.
Multi-language support: Extend to other sign languages and spoken languages so it works beyond English/ASL.

Built With

accessibility
asi1
asl
claude-code
computer-vision
css-framework/ui
elevenlabs
fetch.ai
framer-motion
javascript
knn
machine-learning
mediapipe
next.js
ngrok
node.js
react
react-three-fiber
sign-language
speech-to-text
tailwind-css
text-to-speech
three.js
twilio
typescript
web-audio-api
web-speech-api
webgl
websocket

Updates

Quan Le started this project — Jun 21, 2026 12:41 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.