WorldLens

WorldLens Logo
Architecture Diagram
Hazard detection.
Extensive debugging

Inspiration

WorldLens was born from a simple observation: for the visually impaired, the world doesn't just need to be seen - it needs to be understood in context. While static AI vision tools exist, they often act as passive answer-machines. I wanted to build a true "digital companion" that maintains a persistent memory of the user's surroundings and proactively assists them. Imagine walking down a grocery aisle and having a friend whisper, "Hey, that low-sugar cereal you were looking for is three feet to your left." That level of proactive, situationally-aware support is what inspired WorldLens.

What it does

WorldLens is a real-time multimodal assistant that sees the world through a mobile camera and explains it conversationally.

Real-Time Voice interaction: Natural, bidirectional conversations powered by Amazon Nova Sonic.
Persistent World Memory: Unlike traditional vision apps, WorldLens builds a cumulative "world model" of what it has seen, allowing it to remember objects and context across camera frames.
Proactive Assistance: It alerts users to hazards or relevant items (like a specific grocery product) based on their predefined goals.
Context-Specific Modes: High-precision pipelines for Grocery Shopping, Document Reading, and Medication Safety.
Smart Sampling: Intelligently captures frames only when motion or speech is detected, ensuring high performance and low battery drain.

How I built it

I leveraged the full power of the Amazon Nova model suite through Amazon Bedrock:

Amazon Nova Sonic: Acts as the central voice orchestrator, handling bidirectional speech-to-speech interaction and native tool use.
Amazon Nova Lite: Performs the heavy lifting for multimodal scene understanding, OCR, and complex reasoning over historical session data. Also handles visual grounding and fact-verification in the MVP.
Amazon Nova Act (Simulated): The architecture is designed to integrate Nova Act for deep external grounding; in the current MVP, this is simulated through Nova Lite-powered reasoning.
Frontend: A Next.js mobile web app utilizing the MediaDevices API and real-time Voice Activity Detection (VAD).
Backend: AWS Lambda and DynamoDB for session state and memory management.
Infrastructure: Deployed via AWS CDK with a "Zero-Touch IAM" philosophy for seamless setup.

Challenges I ran into

The Latency Barrier: Achieving sub-1.5 second end-to-end latency from "seeing" to "speaking" required careful optimization of the bidirectional stream.
Cognitive Overload: Balancing proactive alerts so the AI is helpful but not annoying required fine-tuning the "Proactive Guardrails" and cooldown timers.
Smart Sampling: Designing a client-side motion detection and VAD system to ensure we only send high-quality, relevant frames to Bedrock to save on costs and tokens.

Accomplishments that I'm proud of

Native Sonic Orchestration: Successfully replacing a traditional STT -> LLM -> TTS pipeline with a single, high-speed Nova Sonic session.
Grounded Reasoning: Using Nova Lite to verify visual observations against general knowledge and context, providing a baseline for safety.
The "Aha!" Moment: Seeing the AI proactively chime and offer a suggestion based on an object it saw 30 seconds ago in a different part of the shelf.
Accessibility Integration: Implementing a system of "Earcons" (audio cues) that provide state feedback to visually impaired users without interrupting the conversation.

What I learned

Native Tool Use is a Game Changer: Nova Sonic’s ability to invoke vision tools mid-conversation drastically simplifies the architecture of real-time AI agents.
Memory vs. Noise: I learned the importance of "Context Compression" - periodically summarizing the world model so the AI maintains situational awareness without being overwhelmed by every single frame.

What's next for WorldLens

Long-Term Spatial Memory: Moving beyond single sessions to "remember" the layout of the user's home or local pharmacy.
Haptic Feedback: Integrating spatial haptics to guide a user's hand toward a specific item on a shelf.
Nova Act Expansion: Deeper integration with Nova Act for complex tasks like "Find the best deal for this medicine online and check if my insurance covers it."

Built With

amazon-bedrock
amazon-cognito
amazon-dynamodb
amazon-lambda
amazon-nova
amazon-web-services
next
next.js
nextjs
nova
nova-2-lite
nova-2-sonic

Updates

Lasse Stilvang started this project — Mar 16, 2026 06:32 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.