Inspiration

What it does

How we built it

Inspiration

We've all stood in front of a landmark or a historic building and wished we could understand what we were seeing — not just a name on a sign, but the story behind it. Audio guides are static. Search results are disconnected. I wanted to build something that feels closer to a real guide: an agent that can look where you are looking, hear what you are asking, speak back immediately, and stay grounded in real sources instead of confident guesswork.

The Gemini Live API was the unlock. Once real-time multimodal streaming and interruption became possible, LensIQ stopped being a search UI and started becoming a live camera-native experience.

What it does

LensIQ combines five experiences into one live exploration workflow:

  • Live Voice Agent: the user points the camera, speaks naturally, and receives spoken responses. LensIQ supports real interruption, so the user can cut in mid-answer and redirect the conversation without waiting for the model to finish.
  • Explain: LensIQ analyzes the current frame, identifies the scene or landmark, and returns a grounded summary with citations and confidence metadata.
  • Time Travel: LensIQ reconstructs what a place or scene may have looked like in earlier eras, overlays the result on the live camera view, and labels whether the result is archival, inferred, or reconstructed.
  • Nearby: LensIQ surfaces nearby places with distance, route context, and quick navigation handoff.
  • Creative Lab: LensIQ can generate images, generate short videos, and analyze media using the same camera-first context.

The result is an agent that does more than answer questions. It sees, hears, speaks, grounds its claims, and keeps the interaction live.

How I built it

LensIQ is a mobile-first React and Vite application backed by an Express server in TypeScript. The frontend handles camera capture, audio capture, playback, AR-style overlays, and the overall live exploration interface. The backend owns every provider integration so no API keys are exposed in the browser.

For AI, I used the Google GenAI SDK with Gemini across multiple workflows:

  • Gemini Live for real-time voice and vision sessions
  • Gemini Flash / Pro for explanation, chat, and reasoning flows
  • Gemini image generation for historical reconstruction and creative output
  • Veo for short-form generated video

The system is coordinated by a centralized multimodal orchestrator. Instead of separate hooks trying to manage camera state, voice state, grounding, nearby results, and time travel independently, LensIQ routes those interactions through a single reducer/effects loop. That makes the experience more coherent and easier to test.

On Google Cloud, the app is containerized and deployed to Cloud Run. It supports Cloud SQL for persistence and Google Cloud Storage for generated assets. The repo also includes deployment and proof materials so judges can verify the cloud setup directly.

Data sources used

LensIQ combines model output with external sources so the experience stays useful and grounded:

  • Google Places API
  • Google Routes API
  • Wikipedia
  • Wikidata
  • Library of Congress
  • Internet Archive

Those sources are surfaced back to the user through citations, source labels, and confidence metadata rather than hidden behind a single model answer.

Challenges I ran into

1. Making voice interaction feel genuinely live

It is easy to fake a voice demo by waiting for the model to finish talking. It is much harder to make interruption feel natural. I had to detect speech onset quickly, stop local playback immediately, signal activity changes to Gemini Live, and recover cleanly when late audio chunks arrived after an interruption.

2. Keeping browser audio stable

Real-time audio in mobile browsers is unforgiving. I had to deal with AudioContext suspension, resampling, buffering, jitter, and playback continuity while still keeping latency low enough for a live conversation.

3. Grounding a visual agent without killing speed

A camera agent that answers quickly but invents facts is not useful. A camera agent that waits too long for perfect retrieval also feels broken. I had to balance fast multimodal responses with source retrieval, confidence scoring, and truth-in-labeling so the system stayed both responsive and credible.

4. Making Time Travel honest

Historical reconstruction is compelling, but it is also risky. I had to distinguish clearly between archival material, model inference, and generated imagery so users understand what is verified and what is reconstructed.

Accomplishments that I'm proud of

  • True barge-in: LensIQ supports mid-response interruption, which makes the experience feel much closer to a real conversation than a push-to-talk demo.
  • Grounded explanations: answers are paired with citations, provider metadata, and confidence signals instead of being presented as opaque model output.
  • A camera-native Time Travel experience: instead of showing a separate static before-and-after screen, LensIQ keeps the camera live and layers history onto the current scene.
  • Backend-owned provider integration: Gemini, Maps, storage, auth, and database credentials stay on the server.
  • Graceful degradation: when a capability is unavailable, the app reports the limitation instead of pretending the feature worked.

What I learned

  • The biggest UX shift in multimodal AI is not just better output quality. It is removing the need to translate the physical world into text prompts.
  • Real-time voice UX depends as much on buffering, playback, and interruption handling as it does on the model itself.
  • Grounding should be part of the core product experience, not an afterthought. If an agent is making claims about the real world, users need provenance.
  • Centralized orchestration pays off quickly in multimodal products. Once camera, voice, grounding, and overlays start interacting, ad hoc state management becomes fragile.

What's next for LensIQ

  • Multilingual live exploration so travelers can ask questions and hear answers in their preferred language
  • Stronger scene memory so LensIQ can remember places, follow-up questions, and prior discoveries across sessions
  • Collaborative sessions so multiple users can explore the same live scene together
  • Deeper AR anchoring for historical overlays and nearby guidance
  • Offline-aware behavior for travel scenarios with unstable connectivity

Built With

  • audiocontext)
  • cloud-build-**apis-&-data-sources**:-google-places-api-(new)
  • express.js
  • framer-motion-**ai-models-&-sdks**:-gemini-live-api-(`gemini-2.5-flash-native-audio-preview`)
  • gemini-flash-(`gemini-3-flash-preview`)
  • gemini-image-(`gemini-3-pro-image-preview`)
  • gemini-pro-(`gemini-3-pro-preview`)
  • google-cloud
  • google-cloud-sql-(postgresql)
  • google-genai-sdk-(`@google/genai`)-**cloud-services**:-google-cloud-run
  • google-oauth
  • google-routes-api
  • internet-archive-advanced-search-api-**web-apis-&-protocols**:-web-audio-api-(audioworklet
  • library-of-congress-prints-&-photographs-api
  • mediadevices
  • node.js
  • react-19
  • secret-manager
  • tailwind-css
  • veo-(`veo-3.1-fast-generate-preview`)
  • vite-6
  • websockets-(streaming-binary-audi/video)
  • wikidata-api
  • wikipedia-rest-api
Share this project:

Updates