Inspiration San Francisco's Fillmore District once had forty jazz clubs on a single street. The Barbary Coast was the most notorious waterfront in the world. Today, most people walk past these places with no idea what happened there. Gentrification doesn't just displace people — it erases memory. Flaneur exists because that erasure felt wrong, and because the technology to fight it finally exists. What it does Flaneur is a voice and vision agent that gives San Francisco neighborhoods a living, speaking memory. Point your phone at a street, a building, or a photograph — Amos, a character embodying a 90-year-old long-time resident, narrates what was there before. Three specialist sub-agents work in parallel: a Historian queries a real-time knowledge base, an Archivist surfaces archival photographs, and a Cartographer resolves your location to historical AR pins. How we built it The core is Gemini Live for real-time voice and vision, and the Gemini Deep Research Interactions API for autonomous background research. Three sub-agents — Historian, Archivist, Cartographer — orchestrate via asyncio, writing to a live trace queue the audience can watch. The knowledge base runs on DigitalOcean Gradient. API access is managed by Unkey. The frontend is a pure HTML/JS PWA using camera, microphone, GPS, and device orientation simultaneously. Challenges we ran into Wikimedia Commons image licensing was our first wall — Panoramio-sourced images 403'd universally, requiring a license filter and canonical thumbnail URL reconstruction. Nominatim geocoding failed silently without a properly identified User-Agent. Claude's entity extraction from image captions pulled photographer credits as place names. Each failure required surgical debugging under time pressure, with the research agent running overnight and the morning revealing what worked. Accomplishments that we're proud of Building a genuinely multi-agent system — not a chatbot with tools, but an orchestrator with named specialist agents running concurrently — in 5.5 hours as a solo engineer. The overnight research agent running without intervention and producing a rich knowledge base. The Gemini Deep Research integration firing autonomously at session start. And Amos himself: a character grounded in real archival research who speaks with specificity and grief about what was lost. What we learned Gemini Live's native audio processing — collapsing the STT → LLM → TTS stack — changes the feel of voice agents fundamentally. The latency difference is visceral. We also learned that the most technically interesting part of a multimodal agent isn't the model — it's the data pipeline underneath it. The overnight research agent, the geocoding, the image licensing — that unglamorous infrastructure is what makes Amos credible rather than generic. What's next for Your Flaneur Tour du Jour Flaneur should cover every historically significant neighborhood in every city — not just San Francisco. The AR layer needs true building-level anchoring using depth sensors and visual positioning. Amos should evolve with each conversation, building a persistent memory of what users have asked and discovered. And the character model itself — the idea of an embodied neighborhood elder as interface — deserves a design language of its own.
Built With
- anthropic-claude-api
- chromadb
- commons
- digitalocean-gradient-ai-platform
- digitalocean-gradient-knowledge-base
- fastapi
- gemini-deep-research-interactions-api
- gemini-live-api
- leaflet.js
- python
- unkey
- websockets
- wikimedia
Log in or sign up for Devpost to join the conversation.