ArmVision Assist — Offline Vision Action Agent

Live text detection using ml ar kit and autonomous agent action using sensor 3ao9tr
Interface of the entire system including the hud panel and running entirely offline in airplane mode
Real world interaction with texts and objects

Inspiration

Most vision accessibility tools focus on reading text aloud. During early user testing with a visually impaired volunteer, a simple but important insight emerged:

“I don’t need the phone to read the text. I need it to tell me what to do with it.”

That changed the direction of the project.

Instead of building another OCR reader, the goal became to build an AI vision agent that understands context and triggers actions instantly.

Another major issue with existing tools is that they rely heavily on cloud processing, which introduces latency, privacy concerns, and complete failure when internet connectivity is poor.

ArmVision Assist was created to answer a simple question:

What if a phone could see real-world text and immediately help users act on it — entirely offline?

What it does

ArmVision Assist is an offline AI vision agent that observes text through a smartphone camera and converts it into immediate, actionable commands.

The system processes live camera frames and performs three tasks simultaneously:

Text Understanding On-device OCR extracts text from the camera feed in real time.
Context Inference A lightweight inference engine determines the meaning of detected text such as:

phone numbers

URLs

email addresses

payment links

medicine labels

warning signs

Action Generation Based on the detected context, the agent suggests actions like:

Call detected phone numbers

Open websites

Compose emails

Trigger UPI payments

Save contacts

Issue safety alerts for hazardous warnings

The system runs fully offline on ARM devices, providing fast responses, improved privacy, and reliability even in airplane mode.

How we built it

ArmVision Assist combines several on-device systems designed for efficient mobile performance.

Vision Pipeline

CameraX live frame processing

Zero-copy ImageAnalysis pipeline using YUV_420_888

Google ML Kit Text Recognition v2 for on-device OCR

Lighting histogram to detect low-light conditions and trigger the flashlight automatically

ARM-Optimized Hazard Classifier

Custom lightweight model trained on hazard keywords and safety patterns

Converted to TensorFlow Lite INT8 quantized format

Optimized for ARM NEON instructions

Inference latency of approximately 14–20 ms on mid-range ARM devices

Context Intelligence Engine

A hybrid reasoning system combining:

keyword density scoring

regex pattern detection

domain-specific rules (finance, safety, medical)

This allows the system to interpret extracted text and determine possible actions.

AR Action Interface

Real-time AR overlays mapped to camera coordinates

Interactive action chips (Call, Email, Open Link, Pay)

Haptic alerts for safety warnings

Speech throttling to prevent repetitive announcements

The entire application was built as a native Kotlin Android app with no cloud dependencies.

Challenges we ran into

Developing a real-time vision system on mobile hardware presented several challenges.

Frame latency and buffer conversion Initial OCR pipelines caused frame drops due to unnecessary image buffer conversions. Implementing a zero-copy pipeline significantly improved processing speed.

Thermal throttling Continuous OCR workloads pushed tasks to the big CPU cores, causing thermal throttling. Profiling with Perfetto helped move non-critical tasks to LITTLE cores.

Low-contrast hazard detection Early versions of the hazard classifier struggled with faded labels. Synthetic training data was expanded with blurred and warped text samples to improve robustness.

AR overlay alignment Different devices have different camera sensor aspect ratios. A normalized coordinate mapping layer was implemented to ensure stable overlay positioning across devices.

Speech spam from continuous OCR Constant frame analysis triggered repeated TTS output. A speech throttling and context grouping mechanism was added to improve user experience.

Accomplishments that we're proud of

Achieved real-time on-device OCR with sub-200ms latency

Built a fully functional offline AI agent with zero cloud dependency

Implemented ARM-optimized INT8 ML inference

Developed a context-aware action engine that turns text into real-world actions

Created AR overlays and haptic feedback for accessibility

Successfully deployed an installable Android APK for public testing

The project demonstrates that powerful AI assistance can run entirely on-device, without relying on cloud infrastructure.

What we learned

Building ArmVision Assist required exploring multiple aspects of mobile AI engineering.

Key learnings included:

Performance optimization for ARM big.LITTLE CPU architectures

Practical effects of INT8 quantization on latency and battery usage

Using profiling tools like Perfetto and Systrace to diagnose performance bottlenecks

Designing AR overlays that remain stable during camera jitter

Balancing AI inference workloads with mobile thermal limits

The biggest takeaway was that accessibility tools benefit more from speed and reliability than from feature complexity.

What's next for ArmVision Assist — Offline Vision Action Agent

Several improvements are planned to expand the system's capabilities.

Multilingual support Add OCR and context understanding for languages such as Hindi, Tamil, Bengali, and Arabic.

GPU / ARM NN acceleration Integrate GPU delegates or ARM NN for faster inference on supported devices.

Symbol-based hazard detection Extend the hazard classifier to detect visual warning symbols, not just text.

Personalized learning Introduce on-device learning to adapt to individual user preferences.

Accessibility SDK Create a lightweight SDK so NGOs and accessibility device manufacturers can integrate the technology into low-cost assistive devices.

Try it out yourself! (Google drive link with apk provided)

Built With

action
ai
android
apis
apk
ar
arm
between
camera
camerax
context
contextual
coordinate
custom
deployment
distribution
engine
feedback
for
frames
frameworks
google
haptic
inference
inference)
int8
intelligence
interface
kit
kotlin
learning
lite
machine
mapping
ml
mobile
model
native
neon
optimization
overlay
parsing
perfetto
performance
profiling
quantization
quantized
real-time
recognition
regex-based
rendering
rule-based
studio
suggestions
system
systrace
tensorflow
text
ui
v2
via
visualization

Updates

Shyam Sharma started this project — Mar 11, 2026 02:15 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.