What ARIA is

ARIA (Adaptive Real-time Intelligence Assistant) is a universal accessibility assistant built as a React Native (Expo) mobile client and a FastAPI backend tuned for edge deployment on an NVIDIA Jetson Orin Nano Super. It unifies two experiences that are rarely shipped as one product:

  • SIGN mode — interpret American Sign Language from the phone camera using short bursts of frames (not a single still), combine that with facial emotion, complete the user’s intent into a natural sentence, optionally translate, and speak it with expressive text-to-speech.
  • GUIDE mode — keep the camera pointed forward for obstacle awareness using local, GPU-accelerated detection, and pair that with walking directions so navigation is spoken and situational, not only “turn left on Main Street.”

Our north star is in the repo itself: everyone deserves a voice — literally for users who need spoken output, and metaphorically for a design philosophy that treats accessibility as first-class product engineering, not an afterthought.


What inspired us

We started from a frustratingly common observation: accessibility tools are often fragmented. One app for captions, another for navigation, another for emergency workflows — each with different accounts, latency profiles, and failure modes. Meanwhile, multimodal AI (vision + language + audio) finally made it plausible to build one coherent loop: see → understand → act → speak.

We were especially moved by two asymmetries:

  1. Expressive communication — A deaf or mute user’s signing is rich; a naive “letter-by-letter” pipeline can erase grammar, prosody, and emotion. We wanted output that sounds human, not like a robot reading a telegraph.
  2. Safety under motion — For blind users, cloud-only obstacle checks can lag or fail offline. That’s why our stack treats local vision as the right tool for fast, repeated spatial questions, while reserving large multimodal models for language-heavy interpretation where a little latency is acceptable if the sentence is right.

The project was built for Kent Hack Enough 2026 at Kent State University, where we could stress-test the idea end-to-end: real phones, real networks, real-time demos.


What we learned

Real-time is a budget, not a feature

If the phone sends frames at rate (r) (e.g. (r \approx 10\,\text{fps})), each frame has a nominal spacing (\Delta t = 1/r). End-to-end latency (L) (capture + upload + inference + TTS + playback) must stay well below the time users expect for “live” interaction. A useful mental model is how many frames slip past while waiting:

[ n_{\text{lag}} \;=\; L \cdot r ]

If (L = 1.2\,\text{s}) and (r = 10\,\text{s}^{-1}), then (n_{\text{lag}} = 12): the world has moved on by a dozen frames. That’s why batching short sequences and timeouts matter as much as model accuracy.

Reliability beats a single “best” model

We implemented a vision manager that walks an ordered provider list (e.g. Gemini → OpenAI → Claude → optional AWS Bedrock → local), with per-provider health and logging. The probability that at least one provider works in a given window is higher when failures are correlated less than perfectly:

[ P(\text{system OK}) \;=\; 1 - \prod_{i=1}^{k} p_i ]

where (p_i) is the chance provider (i) fails independently (a simplification — real outages correlate — but the lesson holds: fallback chains turn demos into systems).

Obstacles deserve a different architecture than ASL

ASL interpretation benefits from sequence understanding and richer models; obstacle scanning benefits from low jitter and predictable compute. Our backend routes obstacle detection through local YOLO on the Jetson when available — trading a bit of semantic richness for speed and fewer moving parts on the network path.

Accessibility is also ops

We learned to care about cross-network access (tunneling when the phone isn’t on the same LAN), permission UX (camera, mic, location), SOS logging with coordinates, and profiles/preferences so users aren’t stuck with one default behavior.


How we built it

Mobile (Expo)

  • SIGN and GUIDE flows with camera capture, authentication, profile, and server URL configuration.
  • WebSockets for streaming sign frames and navigation/location updates; REST for obstacles, navigation bootstrap, TTS endpoints, and user data.
  • Haptics and audio for urgent cases (e.g. SOS).

Backend (FastAPI on the Jetson)

  • JWT auth, user profile/preferences, transcripts, navigation logs, SOS events, and API usage telemetry (exact persistence varies: SQLite in the main README path; Postgres/Mongo/Redis in the Docker quickstart — same product idea, different deployment scale).
  • Vision abstraction with provider implementations and a fallback orchestrator (timeouts, unhealthy marking, usage logging).
  • ElevenLabs for emotion-aware speech in SIGN mode; Google Maps for walking directions in GUIDE mode.
  • Unit tests around auth, sign/guide flows, ASL helpers, and vision fallback behavior.

DevOps and demo readiness

  • Docker Compose option with nginx fronting the API for a cleaner deployment story.
  • Health endpoints for quick “is the edge brain alive?” checks from the phone or a laptop.

Challenges we faced

  1. Latency vs. accuracy for ASL — Single frames lie. Signs are motion. We moved toward buffered multi-frame analysis with a longer timeout for sequence calls, which stabilizes recognition but complicates UX (“processing…” states, backpressure).
  2. Provider flakiness — Timeouts and rate limits taught us to treat cloud APIs as best-effort: log, degrade, and fail over without taking down the whole demo.
  3. Obstacle semantics — We needed short, actionable warnings (“chair on your left”) rather than essays, and we needed them fast — hence local inference for that path.
  4. Networking reality — Campus Wi-Fi and phone hotspots don’t care about your architecture diagram; tunneling and clear error surfaces in the app became part of the feature set.
  5. Ethical scope — ASL is a full language; a weekend/hackathon build can’t replace interpreters or claim completeness. We focused on transparent limitations, user control, and assistive positioning rather than “solved ASL.”

Where we go next

  • Deeper sign language coverage beyond demo scope, with community validation.
  • On-device or hybrid models where privacy or bandwidth demands it.
  • Personalization (voice, vocabulary, sensitivity of obstacle alerts) driven by real user testing.

Built With

Share this project:

Updates