Memorly - AI-Powered Personal Memory Assistant

Inspiration

In an era where we capture thousands of photos and videos on our phones, finding specific memories becomes increasingly difficult. Traditional photo apps rely on manual tagging, album organization, or simple date-based browsing. We envisioned a smarter solution: What if you could simply ask your device "Show me Christmas celebrations" or "When did I visit Chicago with Sarah?" and get intelligent, conversational answers?

Memorly was born from the frustration of scrolling through endless camera rolls trying to find that one moment. We wanted to build a personal memory assistant that understands context, recognizes faces, and retrieves memories as naturally as recalling them from your own mind.

What it does

Memorly is an AI-powered personal memory assistant that transforms your photos, videos, and notes into an intelligent, searchable knowledge base. Here's what makes it special:

Multimodal Search: Search across images, videos, and text using natural language queries
Semantic Understanding: Find memories by describing what you're looking for, not just exact keywords
Face Recognition: Automatically identifies and clusters people across your media
Conversational Retrieval: Ask questions and get AI-generated summaries of your memories
Location & Time Awareness: Filter memories by places, dates, and detected objects
Vector-Based Similarity: Uses advanced embedding models to understand visual and semantic similarity

Example queries:

"Show me beach vacations from last summer"
"Find photos with John in New York"
"Tell me about my Christmas celebrations"
"When did I go to that Italian restaurant?"

How we built it

Memorly is built as a microservices architecture with specialized services handling different aspects of the pipeline:

Core Technology Stack

Vector Database: Milvus for storing and searching high-dimensional embeddings
Embeddings: Google's Gemini API and multimodal embedding models
LLM: Gemini 2.5 Flash for conversational response generation
Face Recognition: Custom face extraction and clustering service
Computer Vision: Automated feature extraction for objects, scenes, and content
Backend: FastAPI microservices with async processing
Database: MongoDB for metadata and face embeddings
Containerization: Docker Compose for service orchestration

Architecture Pipeline

Ingestion Pipeline (Images & Videos):

Feature Extraction → Detects objects, scenes, content descriptions, and tags
Face Extraction → Identifies faces and creates face embeddings
Face Clustering → Groups similar faces and creates person profiles
Embedding Generation → Creates vector embeddings for semantic search
Video Segmentation → Splits videos into semantic scenes with timestamps
Upsert Service → Stores embeddings and metadata in Milvus

Search & Retrieval Pipeline:

Query Processing → Extracts filters (people, locations, objects, tags)
Embedding Generation → Converts query to vector representation
Vector Search → Performs COSINE similarity search in Milvus
LLM Response Generation → Generates conversational answers using retrieved context

Key Services (10+ microservices)

Gateway Service (Port 9000): Unified API for all operations
Extract Features Service (Port 8001): Computer vision for content analysis
Face Extraction Service (Port 8002): Face detection and embedding
Embed Service (Port 8003): Text and image embedding generation
Upsert Service (Port 8004): Vector database management
Video Segmentation Service (Port 8005): Scene detection and keyframe extraction
Query Processing Service (Port 8006): Natural language query parsing
Search Service (Port 8007): Vector similarity search with filtering
LLM Response Service (Port 8008): Conversational response generation

Challenges we ran into

1. Gemini API Response Format

The biggest challenge was implementing streaming responses from the Gemini API. We initially assumed it would use Server-Sent Events (SSE) format with incremental chunks, but discovered it returns complete JSON arrays instead. This required debugging through multiple layers:

API URL was constructed at class definition time, using default model instead of environment variable
Response parsing needed to collect full payload before JSON parsing
Had to handle Gemini's content safety filters blocking family photos

Solution: Created dynamic API URL construction, full response buffering, and graceful fallback messages for blocked content.

2. Milvus Metadata Retrieval

Search results were returning null/unknown values for all metadata fields despite successful vector searches.

Root Cause: Milvus returns data in nested hit["entity"] structure, and array fields (people, tags, objects) came back as protobuf RepeatedScalarContainer objects that couldn't be JSON serialized.

Solution: Updated field access patterns and added dual-mode handling for both JSON strings and protobuf arrays:

people = entity.get("people", [])
people_list = list(people) if not isinstance(people, str) else json.loads(people)

3. Face Clustering at Scale

Clustering faces across thousands of images required efficient similarity computation and deduplication strategies. Initial approaches were too slow and memory-intensive.

Solution: Implemented MongoDB-based face storage with incremental clustering and configurable similarity thresholds (COSINE distance < 0.4).

4. Video Processing Complexity

Videos required scene segmentation, keyframe extraction, and embedding fusion between visual and textual representations.

Solution: Built a dedicated video segmentation service using PySceneDetect for semantic scene boundaries, then fused visual embeddings (60% weight) with text embeddings (40% weight) for richer semantic representation.

5. Docker Volume Mount Caching

During development, code changes weren't being picked up by containers despite volume mounts and --reload flags.

Solution: Implemented proper container rebuild workflows: docker-compose rm -f → docker-compose build --no-cache → docker-compose up -d

Accomplishments that we're proud of

✅ Complete RAG Pipeline: Built a full Retrieval-Augmented Generation system from scratch with multimodal support

✅ Production-Ready Architecture: Microservices design with health checks, error handling, and graceful degradation

✅ Real-Time Streaming: Implemented SSE-based streaming for conversational responses with metadata-first approach

✅ Intelligent Filtering: Combined vector similarity search with metadata filtering (people, locations, tags, objects)

✅ Face Recognition System: Automatic face detection, embedding generation, and clustering without manual labeling

✅ Video Understanding: Semantic scene segmentation with timestamp-aware retrieval

✅ Developer Experience: Created comprehensive test scripts, population utilities, and health monitoring

✅ Scalability: Designed to handle thousands of media files with efficient vector indexing

What we learned

Technical Insights

Vector databases like Milvus are incredibly powerful for semantic search but require careful schema design and metric selection (COSINE vs IP)
Embedding fusion strategies can dramatically improve retrieval quality for multimodal content
LLM safety filters can be overly conservative for personal content, requiring graceful fallback handling
Microservices architecture provides flexibility but requires robust service discovery and health monitoring
Streaming responses improve perceived performance and enable real-time user feedback

AI/ML Learnings

Multimodal embeddings capture richer semantic meaning than text-only or vision-only approaches
Face clustering requires tuning similarity thresholds based on your dataset characteristics
Query processing benefits from extracting structured filters (entities, locations) before embedding generation
Context formatting for LLMs matters - we learned to filter out embeddings and technical metadata to reduce token costs

Development Best Practices

Health checks at every service level enable faster debugging in distributed systems
Volume mounts + hot reload accelerate development but can have caching pitfalls
Structured logging with JSON format makes troubleshooting async operations much easier
Test utilities (like our populate script) are essential for iterating on search quality

What's next for Memorly - AI-Powered Personal Memory Assistant

Short-term Roadmap

🔜 Mobile App Integration: iOS/Android apps for on-device photo capture and sync

🔜 Real-time Notifications: "You took a photo at this location 1 year ago today"

🔜 Multi-user Support: Family memory sharing with privacy controls

🔜 Advanced Filters: Date ranges, weather conditions, detected emotions

🔜 Memory Collections: Auto-generated albums based on semantic clustering

Long-term Vision

🚀 Offline-First Architecture: On-device embeddings with periodic cloud sync

🚀 Voice Interface: "Hey Memorly, when did I last see grandma?"

🚀 Memory Timeline: Interactive visualization of life events and patterns

🚀 Cross-Platform Sync: Desktop, web, and mobile with end-to-end encryption

🚀 Smart Insights: "You've visited 15 cities this year" or "You take more photos on weekends"

🚀 Integration Ecosystem: Import from Google Photos, iCloud, social media platforms

🚀 Collaborative Memories: Merge media from multiple people at the same event

🚀 Advanced AI Features:

Emotion detection in photos
Activity recognition in videos
Audio transcription for voice memos
Object permanence ("Where did I last see my keys?")

Research Directions

📊 Improved Embedding Models: Fine-tune models on personal photo datasets

📊 Incremental Learning: Update face clusters without full reprocessing

📊 Privacy-Preserving Search: Homomorphic encryption for cloud-based retrieval

📊 Federated Learning: Learn from aggregated usage patterns while preserving privacy

Memorly represents the future of personal memory management - moving beyond manual organization to intelligent, conversational retrieval. Our goal is to make finding and reliving memories as natural as remembering them yourself. 🧠✨