Inspiration
We were inspired by the daily challenges faced by blind and visually impaired individuals in navigating their surroundings, recognizing people, and understanding sign language communication. We realized that while powerful AI technologies exist, they often operate in silos. Our vision was to create an intelligent system that could understand context and dynamically provide the right type of assistance—whether that's identifying a loved one's face, interpreting a hand gesture, or describing obstacles ahead. We wanted to give blind users a comprehensive "AI companion" that adapts to their environment in real-time.
What it does
Percepteye is an AI-powered assistive navigation system that helps blind and visually impaired users understand their surroundings through audio feedback. Using a Raspberry Pi with a built-in camera, the system captures real-time visual data and sends it to our intelligent semantic router. The router analyzes each frame and dynamically routes requests to one of three specialized services:
- Face Recognition + TTS: Identifies people in view and provides natural voice descriptions using ElevenLabs text-to-speech, helping users recognize known and unknown individuals
- Sign Language Detection: Interprets hand gestures and sign language alphabets using our custom-trained deep learning model
- Scene Description: When no faces or gestures are detected, uses Google Gemini 2.5 Flash to describe nearby objects, obstacles, and the surrounding environment The system delivers real-time audio feedback, enabling users to navigate independently and interact confidently with their environment.
How we built it
We built Percepteye as a microservices architecture deployed on Digital Ocean, with three core components:
Semantic Router (FastAPI): Intelligent middleware that analyzes image and audio frames using Google Gemini API Routes requests to the appropriate specialized service based on context Handles all client communication from the Raspberry Pi
Sign Language Detection API Custom-trained deep learning model using transfer learning on ResNet (pre-trained on ImageNet) Fine-tuned on a custom hand gesture alphabet dataset stored in Digital Ocean Spaces Trained using Digital Ocean's GPU AI Playground (Gradient AI) Dockerized and deployed as a REST API endpoint
Face Recognition + TTS API Face detection and recognition system Integrated with ElevenLabs for natural text-to-speech conversion Provides audio descriptions of identified individuals
Scene Description Service Powered by Google Gemini 2.5 Flash Analyzes scenes to detect objects, spatial relationships, text, and potential safety hazards Provides contextual audio descriptions
Technology Stack:
- Python, FastAPI, Docker
- Google Gemini API for intelligent routing and scene analysis
- ElevenLabs for natural voice synthesis
- PyTorch/ResNet for sign language model
- Raspberry Pi for edge device client
- Digital Ocean for cloud infrastructure (VMs + Spaces)
- We employed a hybrid AI approach, combining pre-trained models (for efficiency) with custom-trained models (for specialized tasks), ensuring both accuracy and scalability.
Challenges we ran into
- Intelligent Context-Aware Routing
- Model Training & Transfer Learning
- API Rate Limits & Quota Management
- Real-Time Performance
- Multi-Service Coordination
- Edge Device Constraints
Accomplishments that we're proud of
- We successfully deployed a fully functional multi-service architecture on Digital Ocean with real endpoints that work end-to-end
- Built a context-aware routing system that makes smart decisions about which AI service to invoke based on visual analysis
- Trained and deployed our own deep learning model using transfer learning, achieving accurate gesture recognition
- Successfully combined multiple AI technologies (custom models + pre-trained APIs) into a cohesive system
- Unlike single-purpose assistive devices, Percepteye provides face recognition, sign language interpretation, AND scene description in one unified platform
- Integrated natural-sounding text-to-speech that makes the experience feel conversational and human
- Built with Docker and cloud-native principles, making it easy to scale and maintain
- Leveraged affordable Raspberry Pi hardware, making the solution cost-effective and accessible
What we learned
- Building the semantic router taught us that intelligent routing based on context is more valuable than having powerful individual services
- Standardizing response formats across different services early on saved us countless hours of integration debugging
- Combining custom models with pre-trained APIs (Gemini, ElevenLabs) gave us the best of both worlds—control and cutting-edge capabilities
- Distributing workloads between Raspberry Pi (capture) and cloud services (processing) optimized both cost and performance
- Working on assistive technology gave us deep insights into how AI can genuinely improve quality of life for disabled individuals
- Managing multiple services taught us valuable lessons about API versioning, error handling, and graceful degradation.
What's next for Percepteye
- Extend beyond alphabets to full ASL/BSL word recognition and contextual sentence interpretation
- Implement edge-optimized models for basic functionality when internet connectivity is unavailable
- Add voice-activated controls so users can request specific types of information ("Who is in front of me?" or "What objects are nearby?")
- Extend TTS and scene descriptions to multiple languages



Log in or sign up for Devpost to join the conversation.