Memorly - AI-Powered Personal Memory Assistant
Inspiration
In an era where we capture thousands of photos and videos on our phones, finding specific memories becomes increasingly difficult. Traditional photo apps rely on manual tagging, album organization, or simple date-based browsing. We envisioned a smarter solution: What if you could simply ask your device "Show me Christmas celebrations" or "When did I visit Chicago with Sarah?" and get intelligent, conversational answers?
Memorly was born from the frustration of scrolling through endless camera rolls trying to find that one moment. We wanted to build a personal memory assistant that understands context, recognizes faces, and retrieves memories as naturally as recalling them from your own mind.
What it does
Memorly is an AI-powered personal memory assistant that transforms your photos, videos, and notes into an intelligent, searchable knowledge base. Here's what makes it special:
- Multimodal Search: Search across images, videos, and text using natural language queries
- Semantic Understanding: Find memories by describing what you're looking for, not just exact keywords
- Face Recognition: Automatically identifies and clusters people across your media
- Conversational Retrieval: Ask questions and get AI-generated summaries of your memories
- Location & Time Awareness: Filter memories by places, dates, and detected objects
- Vector-Based Similarity: Uses advanced embedding models to understand visual and semantic similarity
Example queries:
- "Show me beach vacations from last summer"
- "Find photos with John in New York"
- "Tell me about my Christmas celebrations"
- "When did I go to that Italian restaurant?"
How we built it
Memorly is built as a microservices architecture with specialized services handling different aspects of the pipeline:
Core Technology Stack
- Vector Database: Milvus for storing and searching high-dimensional embeddings
- Embeddings: Google's Gemini API and multimodal embedding models
- LLM: Gemini 2.5 Flash for conversational response generation
- Face Recognition: Custom face extraction and clustering service
- Computer Vision: Automated feature extraction for objects, scenes, and content
- Backend: FastAPI microservices with async processing
- Database: MongoDB for metadata and face embeddings
- Containerization: Docker Compose for service orchestration
Architecture Pipeline
Ingestion Pipeline (Images & Videos):
- Feature Extraction → Detects objects, scenes, content descriptions, and tags
- Face Extraction → Identifies faces and creates face embeddings
- Face Clustering → Groups similar faces and creates person profiles
- Embedding Generation → Creates vector embeddings for semantic search
- Video Segmentation → Splits videos into semantic scenes with timestamps
- Upsert Service → Stores embeddings and metadata in Milvus
Search & Retrieval Pipeline:
- Query Processing → Extracts filters (people, locations, objects, tags)
- Embedding Generation → Converts query to vector representation
- Vector Search → Performs COSINE similarity search in Milvus
- LLM Response Generation → Generates conversational answers using retrieved context
Key Services (10+ microservices)
- Gateway Service (Port 9000): Unified API for all operations
- Extract Features Service (Port 8001): Computer vision for content analysis
- Face Extraction Service (Port 8002): Face detection and embedding
- Embed Service (Port 8003): Text and image embedding generation
- Upsert Service (Port 8004): Vector database management
- Video Segmentation Service (Port 8005): Scene detection and keyframe extraction
- Query Processing Service (Port 8006): Natural language query parsing
- Search Service (Port 8007): Vector similarity search with filtering
- LLM Response Service (Port 8008): Conversational response generation
Challenges we ran into
1. Gemini API Response Format
The biggest challenge was implementing streaming responses from the Gemini API. We initially assumed it would use Server-Sent Events (SSE) format with incremental chunks, but discovered it returns complete JSON arrays instead. This required debugging through multiple layers:
- API URL was constructed at class definition time, using default model instead of environment variable
- Response parsing needed to collect full payload before JSON parsing
- Had to handle Gemini's content safety filters blocking family photos
Solution: Created dynamic API URL construction, full response buffering, and graceful fallback messages for blocked content.
2. Milvus Metadata Retrieval
Search results were returning null/unknown values for all metadata fields despite successful vector searches.
Root Cause: Milvus returns data in nested hit["entity"] structure, and array fields (people, tags, objects) came back as protobuf RepeatedScalarContainer objects that couldn't be JSON serialized.
Solution: Updated field access patterns and added dual-mode handling for both JSON strings and protobuf arrays:
people = entity.get("people", [])
people_list = list(people) if not isinstance(people, str) else json.loads(people)
3. Face Clustering at Scale
Clustering faces across thousands of images required efficient similarity computation and deduplication strategies. Initial approaches were too slow and memory-intensive.
Solution: Implemented MongoDB-based face storage with incremental clustering and configurable similarity thresholds (COSINE distance < 0.4).
4. Video Processing Complexity
Videos required scene segmentation, keyframe extraction, and embedding fusion between visual and textual representations.
Solution: Built a dedicated video segmentation service using PySceneDetect for semantic scene boundaries, then fused visual embeddings (60% weight) with text embeddings (40% weight) for richer semantic representation.
5. Docker Volume Mount Caching
During development, code changes weren't being picked up by containers despite volume mounts and --reload flags.
Solution: Implemented proper container rebuild workflows: docker-compose rm -f → docker-compose build --no-cache → docker-compose up -d
Accomplishments that we're proud of
✅ Complete RAG Pipeline: Built a full Retrieval-Augmented Generation system from scratch with multimodal support
✅ Production-Ready Architecture: Microservices design with health checks, error handling, and graceful degradation
✅ Real-Time Streaming: Implemented SSE-based streaming for conversational responses with metadata-first approach
✅ Intelligent Filtering: Combined vector similarity search with metadata filtering (people, locations, tags, objects)
✅ Face Recognition System: Automatic face detection, embedding generation, and clustering without manual labeling
✅ Video Understanding: Semantic scene segmentation with timestamp-aware retrieval
✅ Developer Experience: Created comprehensive test scripts, population utilities, and health monitoring
✅ Scalability: Designed to handle thousands of media files with efficient vector indexing
What we learned
Technical Insights
- Vector databases like Milvus are incredibly powerful for semantic search but require careful schema design and metric selection (COSINE vs IP)
- Embedding fusion strategies can dramatically improve retrieval quality for multimodal content
- LLM safety filters can be overly conservative for personal content, requiring graceful fallback handling
- Microservices architecture provides flexibility but requires robust service discovery and health monitoring
- Streaming responses improve perceived performance and enable real-time user feedback
AI/ML Learnings
- Multimodal embeddings capture richer semantic meaning than text-only or vision-only approaches
- Face clustering requires tuning similarity thresholds based on your dataset characteristics
- Query processing benefits from extracting structured filters (entities, locations) before embedding generation
- Context formatting for LLMs matters - we learned to filter out embeddings and technical metadata to reduce token costs
Development Best Practices
- Health checks at every service level enable faster debugging in distributed systems
- Volume mounts + hot reload accelerate development but can have caching pitfalls
- Structured logging with JSON format makes troubleshooting async operations much easier
- Test utilities (like our populate script) are essential for iterating on search quality
What's next for Memorly - AI-Powered Personal Memory Assistant
Short-term Roadmap
🔜 Mobile App Integration: iOS/Android apps for on-device photo capture and sync
🔜 Real-time Notifications: "You took a photo at this location 1 year ago today"
🔜 Multi-user Support: Family memory sharing with privacy controls
🔜 Advanced Filters: Date ranges, weather conditions, detected emotions
🔜 Memory Collections: Auto-generated albums based on semantic clustering
Long-term Vision
🚀 Offline-First Architecture: On-device embeddings with periodic cloud sync
🚀 Voice Interface: "Hey Memorly, when did I last see grandma?"
🚀 Memory Timeline: Interactive visualization of life events and patterns
🚀 Cross-Platform Sync: Desktop, web, and mobile with end-to-end encryption
🚀 Smart Insights: "You've visited 15 cities this year" or "You take more photos on weekends"
🚀 Integration Ecosystem: Import from Google Photos, iCloud, social media platforms
🚀 Collaborative Memories: Merge media from multiple people at the same event
🚀 Advanced AI Features:
- Emotion detection in photos
- Activity recognition in videos
- Audio transcription for voice memos
- Object permanence ("Where did I last see my keys?")
Research Directions
📊 Improved Embedding Models: Fine-tune models on personal photo datasets
📊 Incremental Learning: Update face clusters without full reprocessing
📊 Privacy-Preserving Search: Homomorphic encryption for cloud-based retrieval
📊 Federated Learning: Learn from aggregated usage patterns while preserving privacy
Memorly represents the future of personal memory management - moving beyond manual organization to intelligent, conversational retrieval. Our goal is to make finding and reliving memories as natural as remembering them yourself. 🧠✨
Built With
- ai/ml
- backblaze-b2
- computer-vision
- docker
- face-recognition
- fastapi
- google-gemini-api
- llm
- microservices
- milvus
- mongodb
- multimodal-embeddings
- natural-language-processing
- next.js
- opencv
- pyscenedetect
- python
- rag
- semantic-search
- sse-streaming
- vector-database

Log in or sign up for Devpost to join the conversation.