Percepteye

Architecture Diagram for Percepteye

Inspiration

We were inspired by the daily challenges faced by blind and visually impaired individuals in navigating their surroundings, recognizing people, and understanding sign language communication. We realized that while powerful AI technologies exist, they often operate in silos. Our vision was to create an intelligent system that could understand context and dynamically provide the right type of assistance—whether that's identifying a loved one's face, interpreting a hand gesture, or describing obstacles ahead. We wanted to give blind users a comprehensive "AI companion" that adapts to their environment in real-time.

What it does

Percepteye is an AI-powered assistive navigation system that helps blind and visually impaired users understand their surroundings through audio feedback. Using a Raspberry Pi with a built-in camera, the system captures real-time visual data and sends it to our intelligent semantic router. The router analyzes each frame and dynamically routes requests to one of three specialized services:

Face Recognition + TTS: Identifies people in view and provides natural voice descriptions using ElevenLabs text-to-speech, helping users recognize known and unknown individuals
Sign Language Detection: Interprets hand gestures and sign language alphabets using our custom-trained deep learning model
Scene Description: When no faces or gestures are detected, uses Google Gemini 2.5 Flash to describe nearby objects, obstacles, and the surrounding environment The system delivers real-time audio feedback, enabling users to navigate independently and interact confidently with their environment.

How we built it

We built Percepteye as a microservices architecture deployed on Digital Ocean, with three core components:

Semantic Router (FastAPI): Intelligent middleware that analyzes image and audio frames using Google Gemini API Routes requests to the appropriate specialized service based on context Handles all client communication from the Raspberry Pi
Sign Language Detection API Custom-trained deep learning model using transfer learning on ResNet (pre-trained on ImageNet) Fine-tuned on a custom hand gesture alphabet dataset stored in Digital Ocean Spaces Trained using Digital Ocean's GPU AI Playground (Gradient AI) Dockerized and deployed as a REST API endpoint
Face Recognition + TTS API Face detection and recognition system Integrated with ElevenLabs for natural text-to-speech conversion Provides audio descriptions of identified individuals
Scene Description Service Powered by Google Gemini 2.5 Flash Analyzes scenes to detect objects, spatial relationships, text, and potential safety hazards Provides contextual audio descriptions

Technology Stack:

Python, FastAPI, Docker
Google Gemini API for intelligent routing and scene analysis
ElevenLabs for natural voice synthesis
PyTorch/ResNet for sign language model
Raspberry Pi for edge device client
Digital Ocean for cloud infrastructure (VMs + Spaces)
We employed a hybrid AI approach, combining pre-trained models (for efficiency) with custom-trained models (for specialized tasks), ensuring both accuracy and scalability.

Challenges we ran into

Intelligent Context-Aware Routing
Model Training & Transfer Learning
API Rate Limits & Quota Management
Real-Time Performance
Multi-Service Coordination
Edge Device Constraints

Accomplishments that we're proud of

We successfully deployed a fully functional multi-service architecture on Digital Ocean with real endpoints that work end-to-end
Built a context-aware routing system that makes smart decisions about which AI service to invoke based on visual analysis
Trained and deployed our own deep learning model using transfer learning, achieving accurate gesture recognition
Successfully combined multiple AI technologies (custom models + pre-trained APIs) into a cohesive system
Unlike single-purpose assistive devices, Percepteye provides face recognition, sign language interpretation, AND scene description in one unified platform
Integrated natural-sounding text-to-speech that makes the experience feel conversational and human
Built with Docker and cloud-native principles, making it easy to scale and maintain
Leveraged affordable Raspberry Pi hardware, making the solution cost-effective and accessible

What we learned

Building the semantic router taught us that intelligent routing based on context is more valuable than having powerful individual services
Standardizing response formats across different services early on saved us countless hours of integration debugging
Combining custom models with pre-trained APIs (Gemini, ElevenLabs) gave us the best of both worlds—control and cutting-edge capabilities
Distributing workloads between Raspberry Pi (capture) and cloud services (processing) optimized both cost and performance
Working on assistive technology gave us deep insights into how AI can genuinely improve quality of life for disabled individuals
Managing multiple services taught us valuable lessons about API versioning, error handling, and graceful degradation.

What's next for Percepteye

Extend beyond alphabets to full ASL/BSL word recognition and contextual sentence interpretation
Implement edge-optimized models for basic functionality when internet connectivity is unavailable
Add voice-activated controls so users can request specific types of information ("Who is in front of me?" or "What objects are nearby?")
Extend TTS and scene descriptions to multiple languages

Built With

Submitted to

YCP Hacks 2025
- Winner Best of Show Overall
- Winner Hardware Hack
- Winner [MLH] Best Use of DigitalOcean Gradient™ AI

Created by

Arnab Maity
Hi I am currently pursuing my Master's in CS at Johns Hopkins University
Venkata Harshavardhan Bontalakoti
Currently a Master's Student in Robotics at Johns Hopkins University
Joy Bhalla

Updates

Arnab Maity started this project — Nov 09, 2025 02:50 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.