Inspiration

Hearo was inspired by the need for accessible technology for the visually impaired and blind community. We wanted to create an app that allows users to learn more about their surroundings through voice, without having to rely on others for assistance.

What it does

Hearo is a voice-first accessibility app that empowers visually impaired and blind users to understand their surroundings through AI. Here's what users can do:

Core Features

  • 📸 Capture photos using a simple, accessible camera interface with large touch targets
  • 🤖 AI-powered image analysis that describes what's in the photo using Google Gemini Vision
  • 🎤 Voice commands to ask questions about captured images
  • 🔊 Natural text-to-speech responses using ElevenLabs API with fallback to native TTS
  • 🎧 Tutorial that explains how to use the app via voice guidance

User Flow

  1. Open the app → Tap the large camera button on the home screen
  2. Take a photo → Simple camera interface captures your surroundings
  3. Ask questions → Use voice commands to inquire about what's in the image
  4. Get AI responses → Hearo analyzes the photo and speaks back detailed descriptions
  5. Hands-free navigation → Entire experience designed for voice-first interaction

Accessibility Features

  • Voice-first design → All interactions can be completed without seeing the screen
  • Large touch targets → Easy-to-tap buttons (320x320px camera button)
  • Audio feedback → Clear voice guidance for all actions
  • Tutorial mode → Built-in "How to Use" feature explains the app via TTS
  • Question-based interaction → Natural conversation with the AI about images

How we built it

Frontend

  • React Native + Expo for cross-platform mobile development.
  • Audio: expo-av for playback and recording, expo-speech for native TTS fallback.

Backend & Cloud Infrastructure

  • Node.js + Express backend server deployed on Google Cloud Run
  • Google Cloud Storage for image file storage and management
  • RESTful API endpoints:
    • Image upload and storage
    • Generate signed URLs for secure uploads
    • Health check endpoint
  • Service Account authentication for secure Google Cloud access
  • Environment variable management for API keys and configuration

AI & Voice Services

  • ElevenLabs TTS API for natural voice synthesis with fallback to native TTS
    • High-quality voice models for accessible audio
  • ElevenLabs STT API for speech-to-text transcription (English-only)
    • Real-time voice command processing
    • Language-specific configuration to prevent misclassification
  • Google Gemini Vision Model for AI-powered image understanding and description
    • Multimodal AI combining vision + text for contextual responses
  • Google Cloud Vision API integration for advanced image analysis

Data Flow & Processing Pipelines

Image Analysis Pipeline:

  1. User captures photo with expo-camera
  2. Image converted to Base64 encoding
  3. Uploaded to backend Express server
  4. Stored in Google Cloud Storage bucket
  5. GCS URI (gs://) passed to Gemini Vision API
  6. AI generates detailed description
  7. Response sent via TTS to user

Audio Pipeline:

  1. User records voice command
  2. Audio converted: Recording → ArrayBuffer → Uint8Array → Base64
  3. Sent to ElevenLabs STT API
  4. Transcribed text processed
  5. Response generated via Gemini (if image-related)
  6. TTS speaks response back to user

Voice-First Navigation:

  • Hands-free app operation through voice commands
  • Audio feedback for all actions
  • Accessible UI design with large touch targets

Development & Deployment

  • Expo Go for rapid development and testing
  • TypeScript interfaces for type-safe API responses
  • Error handling & fallbacks for offline/API failure scenarios
  • Google Cloud Run for serverless backend deployment
  • File system caching with Expo FileSystem for audio files

Challenges we ran into

Binary Audio Handling

  • Problem: Blob objects are not supported in React Native
  • Solution: Implemented conversion chain: ArrayBuffer → Uint8Array → Base64 for TTS playback
  • Required React Native-specific audio handling different from web approaches

STT Language Misclassification

  • Problem: Short phrases were misinterpreted as non-English languages
  • Solution: Forced language parameter to "en" in ElevenLabs API requests
  • Added input validation and controlled language constraints

Google Cloud Authentication

  • Problem: Complex service account setup and credential management
  • Solution: Implemented secure service account authentication with proper scoping
  • Environment variable management for Google Cloud credentials

Expo Deprecation Warnings

  • Problem: expo-av showing deprecation notices
  • Solution: Careful consideration for future migration paths
  • Implemented fallback strategies for deprecated features

Accomplishments that we're proud of

  • Fully functional voice-first navigation, allowing hands-free operation
  • Seamless integration of ElevenLabs TTS and STT APIs with React Native, including error handling and fallbacks
  • AI-powered image understanding using Google Gemini Vision API for detailed scene descriptions
  • End-to-end serverless architecture with Google Cloud Run backend
  • Secure image storage with Google Cloud Storage and signed URLs
  • Clean, TypeScript-safe codebase with React Native-compatible audio handling
  • Cross-platform mobile app working in both Expo Go and production builds
  • Robust error handling with graceful fallbacks for API failures
  • Accessible UI design optimized for voice interaction

What we learned

Technical Learnings

  • React Native requires different approaches to handle binary data compared to the web; Blob cannot be used
  • Multimodal AI prompts require careful crafting to get accurate and useful responses
  • Service account authentication is critical for secure cloud service integration
  • Serverless deployment (Google Cloud Run) simplifies infrastructure management but requires proper environment configuration

Accessibility Insights

  • Voice-first design is challenging but rewarding; proper audio feedback is critical for accessibility
  • Real-time STT can be inaccurate for short phrases, highlighting the importance of controlled input and language constraints
  • Large touch targets and clear audio cues are essential for visually impaired users

Development Best Practices

  • Environment variables (process.env) and proper handling of API keys are essential for secure API integration
  • TypeScript interfaces significantly improve code quality and developer experience
  • Fallback strategies are crucial for production apps (native TTS when API fails, offline mode, etc.)

What's next for Hearo

Voice & Audio Enhancements

  • Continuous voice listening mode for fully hands-free interaction
  • Multi-language STT support, allowing users to dictate in multiple languages
  • Voice activity detection for better speech recognition triggers

Vision & AI Features

  • Real-time object detection using camera stream (not just static photos)
  • Scene understanding with spatial awareness and depth perception
  • Document OCR for reading text from images (receipts, signs, documents)
  • Face recognition for identifying people (with privacy controls)

Storage & Sync

  • Cross-device synchronization for accessing content across multiple devices

Accessibility Improvements

  • Gesture controls for common actions (swipe to delete, etc.)
  • Multiple voice options from ElevenLabs library
  • Screen reader integration for system-wide accessibility

Built With

Share this project:

Updates