Inspiration
I've accumulated thousands of photos over years of travel — Japan, Argentina, Peru, Costa Rica — scattered across devices, cloud backups, and forgotten folders. Every time I wanted to find a specific memory, I'd scroll endlessly through flat grids of thumbnails. I realized the way we browse photos hasn't evolved in decades: it's still just chronological lists. I wanted something that felt like actually being inside my memories — floating through them in 3D space, speaking naturally to find moments I'd half-forgotten. The Elasticsearch Agent Builder hackathon was the perfect catalyst: I could build an intelligent agent that truly understands my photo collection through semantic search, visual embeddings, and multi-step reasoning — not just keyword matching, but real comprehension of what's in each image.
What it does
Reminiscence Vault is a voice-controlled 3D photo visualization experience powered by an Elasticsearch Agent Builder agent. I speak naturally — "show me sunsets from Japan" or "find photos where I'm eating street food" — and the agent retrieves semantically relevant results using hybrid search (kNN vector similarity + semantic text matching via RRF), rearranges photos in immersive 3D layouts, and narrates what it finds. Photos float in a cosmic starfield environment across 11 different spatial layouts (helix, vortex, spiral film roll, tunnel, and more). The agent dynamically selects tools — Elasticsearch Search for hybrid retrieval, ES|QL for analytical queries like "how many photos do I have from each country?", and Workflows for multi-step operations like filtering by mood then switching music to match the atmosphere. Hand gesture recognition lets me rotate, zoom, and select photos with six different gestures via MediaPipe. Every image is auto-described by Gemini Flash, mood-classified, and embedded with Vertex AI multimodal embeddings — all stored and searchable in Elasticsearch with semantic_text fields that handle text embedding automatically at index time.
How I built it
The architecture has three layers:
Elasticsearch Serverless is the backbone. Each image is stored with a dense_vector field (1408-dim from Vertex AI multimodalembedding@001) for visual similarity, semantic_text fields for description and location (auto-embedded by ES Serverless — no manual text embedding needed), and binary thumbnails. Search uses the retriever API with an rrf retriever combining kNN vector search with semantic text matching — so searching "cherry blossoms" finds photos both visually similar to cherry blossoms AND described as containing them.
Elastic Agent Builder orchestrates the intelligence. I configured a multi-step agent with three tool types: Search tools that query my reminiscence-vault-images index using hybrid retrieval, ES|QL tools for analytical queries across the collection (aggregations by country, date ranges, mood distributions), and Workflows that chain operations — like finding all photos from a location, classifying the dominant mood, and triggering the corresponding ambient music. The agent uses reasoning to select which tool fits each query, handles ambiguous requests by asking follow-up questions, and maintains conversational context across turns.
Frontend is React + Three.js (react-three-fiber) with Zustand state management. The voice pipeline streams audio via WebSocket — real-time ASR, agent reasoning with tool calls hitting Elasticsearch, and TTS responses back as audio. A smart image filtering pipeline processes HEIC + JPG files through five stages: EXIF metadata filtering, screenshot/burst detection, Gemini Flash content classification, visual deduplication (cosine similarity > 0.95), and location inference from chronological proximity.
Challenges I ran into
Elasticsearch Serverless quirks were my biggest hurdle. There are no number_of_shards or number_of_replicas settings — Serverless manages those automatically, and my initial index creation kept failing until I removed them. Bulk indexing with semantic_text fields requires refresh="wait_for" with a 300-second timeout because the ML model needs to load on first use. I burned hours debugging timeout errors before discovering this.
Hybrid search tuning was tricky. Pure kNN would miss photos with great descriptions but average visual similarity, while pure semantic search missed visually stunning photos with sparse descriptions. The RRF retriever was the breakthrough — it fuses both ranked lists without needing to tune weights, and the results immediately felt right.
HEIC image support seemed simple but had cascading complexity. My iPhone's 2,923 HEIC photos were being completely ignored. After adding pillow-heif, I realized most of those files were WhatsApp forwards, screenshots, receipts, and duplicates. I built a multi-stage filter pipeline — free EXIF checks first, then Gemini Flash classification, then embedding-based dedup — to go from 3,681 raw files to a curated, high-quality collection.
Voice agent latency required careful architecture. Streaming PCM audio at 16kHz while simultaneously receiving agent responses and rendering 3D animations demanded tight coordination between WebSocket frames, audio buffers, and React state updates.
Accomplishments that I'm proud of
The moment I first spoke "take me to Japan" and watched hundreds of photos rearrange into a spiral while the agent narrated my trip — that gave me chills. The semantic_text field type in Elasticsearch is genuinely magical: I just index plain text descriptions and ES handles embedding automatically, making every image instantly searchable with natural language. The five-stage image filtering pipeline is something I'm particularly proud of — it takes thousands of raw camera roll images and intelligently curates them down to only real, meaningful photographs, with smart location inference that fills in GPS gaps by analyzing chronological proximity on the same day. The whole system feels like a living photo album that understands what's in every image.
What I learned
Elasticsearch's semantic_text is a game-changer. Not having to manage text embedding infrastructure separately — just declare the field type and ES handles it — dramatically simplified my architecture. Combined with dense_vector for image embeddings and RRF for fusion, the search quality is remarkable.
Agent Builder's tool selection is more nuanced than I expected. Giving the agent both Search and ES|QL tools and letting it decide which to use for each query produced surprisingly intelligent behavior — it naturally uses Search for "find beach photos" but switches to ES|QL for "which country has the most photos."
Multimodal embeddings unlock cross-modal search. Because Vertex AI's multimodalembedding@001 places images and text in the same 1408-dimensional vector space, a text query like "golden hour" finds photos with warm sunset lighting even if the description never mentions those words. This was the key insight that made the whole system feel intelligent.
What's next for Reminiscence Vault
I want to add multi-agent collaboration — a curator agent that automatically organizes photos into story arcs ("your Japan trip, day by day"), a style agent that detects photographic patterns across the collection, and a memory agent that surfaces "on this day" flashbacks. I'm also exploring real-time collaborative exploration where multiple people can navigate the same 3D space together via voice, sharing and rediscovering memories as a group. On the Elasticsearch side, I want to leverage time-series analysis to detect patterns in when and where I take photos, and geo-aware clustering to automatically build travel narratives. The ultimate vision is a personal AI that knows my visual history as well as I do — and helps me remember what I've forgotten.
Log in or sign up for Devpost to join the conversation.