Inspiration

WorldLens was born from a simple observation: for the visually impaired, the world doesn't just need to be seen - it needs to be understood in context. While static AI vision tools exist, they often act as passive answer-machines. I wanted to build a true "digital companion" that maintains a persistent memory of the user's surroundings and proactively assists them. Imagine walking down a grocery aisle and having a friend whisper, "Hey, that low-sugar cereal you were looking for is three feet to your left." That level of proactive, situationally-aware support is what inspired WorldLens.

What it does

WorldLens is a real-time multimodal assistant that sees the world through a mobile camera and explains it conversationally.

  • Real-Time Voice interaction: Natural, bidirectional conversations powered by Amazon Nova Sonic.
  • Persistent World Memory: Unlike traditional vision apps, WorldLens builds a cumulative "world model" of what it has seen, allowing it to remember objects and context across camera frames.
  • Proactive Assistance: It alerts users to hazards or relevant items (like a specific grocery product) based on their predefined goals.
  • Context-Specific Modes: High-precision pipelines for Grocery Shopping, Document Reading, and Medication Safety.
  • Smart Sampling: Intelligently captures frames only when motion or speech is detected, ensuring high performance and low battery drain.

How I built it

I leveraged the full power of the Amazon Nova model suite through Amazon Bedrock:

  • Amazon Nova Sonic: Acts as the central voice orchestrator, handling bidirectional speech-to-speech interaction and native tool use.
  • Amazon Nova Lite: Performs the heavy lifting for multimodal scene understanding, OCR, and complex reasoning over historical session data. Also handles visual grounding and fact-verification in the MVP.
  • Amazon Nova Act (Simulated): The architecture is designed to integrate Nova Act for deep external grounding; in the current MVP, this is simulated through Nova Lite-powered reasoning.
  • Frontend: A Next.js mobile web app utilizing the MediaDevices API and real-time Voice Activity Detection (VAD).
  • Backend: AWS Lambda and DynamoDB for session state and memory management.
  • Infrastructure: Deployed via AWS CDK with a "Zero-Touch IAM" philosophy for seamless setup.

Challenges I ran into

  • The Latency Barrier: Achieving sub-1.5 second end-to-end latency from "seeing" to "speaking" required careful optimization of the bidirectional stream.
  • Cognitive Overload: Balancing proactive alerts so the AI is helpful but not annoying required fine-tuning the "Proactive Guardrails" and cooldown timers.
  • Smart Sampling: Designing a client-side motion detection and VAD system to ensure we only send high-quality, relevant frames to Bedrock to save on costs and tokens.

Accomplishments that I'm proud of

  • Native Sonic Orchestration: Successfully replacing a traditional STT -> LLM -> TTS pipeline with a single, high-speed Nova Sonic session.
  • Grounded Reasoning: Using Nova Lite to verify visual observations against general knowledge and context, providing a baseline for safety.
  • The "Aha!" Moment: Seeing the AI proactively chime and offer a suggestion based on an object it saw 30 seconds ago in a different part of the shelf.
  • Accessibility Integration: Implementing a system of "Earcons" (audio cues) that provide state feedback to visually impaired users without interrupting the conversation.

What I learned

  • Native Tool Use is a Game Changer: Nova Sonic’s ability to invoke vision tools mid-conversation drastically simplifies the architecture of real-time AI agents.
  • Memory vs. Noise: I learned the importance of "Context Compression" - periodically summarizing the world model so the AI maintains situational awareness without being overwhelmed by every single frame.

What's next for WorldLens

  • Long-Term Spatial Memory: Moving beyond single sessions to "remember" the layout of the user's home or local pharmacy.
  • Haptic Feedback: Integrating spatial haptics to guide a user's hand toward a specific item on a shelf.
  • Nova Act Expansion: Deeper integration with Nova Act for complex tasks like "Find the best deal for this medicine online and check if my insurance covers it."

Built With

Share this project:

Updates