Inspiration
Every important moment in our lives eventually fades. Photos sit in a gallery app, never revisited. Voice notes expire. The stories behind the most meaningful experiences (a speech at the Presidential Villa, a graduation walk, a father's last road trip) get reduced to thumbnails.
I wanted to build something that doesn't just store memories. Something that tells or narrates them.
The inspiration came from watching Nike's campaign videos (bold, graphic, emotionally precise short films) that make you feel something in 40 seconds. I thought: what if AI could do that for ordinary people? Not for athletes or celebrities, but for a young developer who gave a speech that changed his life. For anyone with a memory worth telling. With this, I was able to come up with Mémoire.
What it does
Mémoire turns personal memories into animated graphic novel films. A user uploads photos and an optional voice recording about a memory. Mémoire's multi-agent AI pipeline:
- Analyses the uploaded media to understand what matters most not just what's there, but why it matters. Which photo is the emotional peak? What recurring symbols appear? What phrases does the person naturally use when they tell this story?
- Writes a 5-beat cinematic script in the style of a short film, following the arc: anticipation to building to climax to reflection to resolution
- Illustrates each beat as a bold comic panel using Gemini 2.5 Flash Image, with the user's actual photos as visual reference so the character looks like them, in their real environment
- Narrates each panel in a warm voice using Google Cloud TTS, paced naturally with SSML
- Animates every panel into an 8-second video clip using Veo 3 image-to-video
- Plays it back as a 40-second personal graphic novel film the user can watch and share
The entire pipeline streams in real time, users watch their panels draw themselves in as they are created.
How I built it
Multi-Agent Backend (FastAPI)
The backend is built around three specialised agents:
MediaAnalyst Agent runs first. Using Gemini 2.0 Flash with vision, it analyses all uploaded photos collectively not one at a time. It identifies recurring elements across photos, pinpoints the emotional peak moment, extracts symbolic details (a handshake with an authority figure means recognition), and listens to voice recordings to extract the key phrases the person naturally uses to tell their story. The output is a MemoryBrief a structured creative brief that guides every subsequent step.
PersonalityAgent runs in parallel, building a voice profile and visual style guide from the uploaded media.
ComicAgent orchestrates the full generation pipeline:
- Calls Gemini 2.0 Flash to write the 5-beat script using the MemoryBrief as creative direction
- Generates Panel 1 first (sequential) with user photos as direct visual reference in Gemini 2.5 Flash Image
- Re-analyses Panel 1 with Gemini Vision to extract a "consistency anchor", a precise character description prepended to all subsequent panel prompts
- Generates Panels 2–5 concurrently (staggered) using the consistency anchor
- Fires Veo 3
predictLongRunningper panel and polls viafetchPredictOperation - Fires Google Cloud TTS per panel concurrently with image generation
- Streams every event to the frontend via SSE as it happens
Frontend (Next.js)
The frontend consumes the SSE stream using fetch() + ReadableStream (not EventSource which only handles GET, and the generation endpoint is POST). Each SSE event updates UI state: panels draw themselves in as panel_ready events arrive, the "Animating..." spinner appears on animating events, and the "Watch Your Story" button appears when comic_done fires.
The ComicPlayer component uses two <video> elements in an A/B crossfade pattern for seamless panel transitions, with narration audio playing simultaneously.
Google Cloud Stack
- Vertex AI — Gemini 2.0 Flash, Gemini 2.5 Flash Image, Veo 3
- Google Cloud TTS — Neural2 voices with SSML
- Cloud Storage — all generated images, audio, and video clips
- Cloud Run — backend deployment
Challenges I ran into
Veo 3's undocumented polling pattern. Veo's predictLongRunning endpoint returns UUID-format operation names that look like standard long-running operations, but cannot be polled via GET /operations/{id} (returns 400: "must be a Long"). The correct method is POST :fetchPredictOperation with the full operation name in the request body. This took significant debugging to discover, it is not clearly documented.
POST endpoints cannot use EventSource. All generation endpoints need a request body (POST), but EventSource only handles GET. This caused completely silent failures, no errors, no events, no output. The fix was rewriting all generation streams to use fetch() + ReadableStream with manual SSE line parsing.
Image consistency across 5 panels. Getting the same character to appear recognisably across all panels required a two-step approach: first, passing user photos directly as visual context to Gemini 2.5 Flash Image; second, analysing Panel 1's output with Gemini Vision to extract a "consistency anchor" description prepended to all subsequent prompts. Without both steps, characters drifted significantly panel to panel.
Gemini 2.5 Flash Image rate limits. Running image generation concurrently caused consistent 429 errors after the first two panels. The solution was reducing to semaphore=1 (sequential generation) with 15/30/60s exponential backoff and a 3-second stagger between panel task creation.
GCS uniform bucket access. The standard blob.make_public() call fails with uniform bucket-level access enabled (which is the recommended GCS security setting). The fix is granting allUsers:objectViewer at the bucket level once, after which all objects are automatically public without per-object ACL calls.
The _is_local_mode() bug. A single wrong environment variable name in gcs_service.py caused the service to always run in local mode — even with GCS fully configured. os.environ.get("memoire-bucket-gemini") was reading the literal bucket name as a variable name instead of os.environ.get("GCS_BUCKET_NAME"). This caused both image and TTS services to fail silently for hours.
Accomplishments that I am proud of
Getting the same character to appear consistently across all 5 panels. This was the hardest visual problem. AI image generation notoriously produces a different-looking person in every frame. I solved it with a two-step approach: first passing the user's actual uploaded photos directly to Gemini 2.5 Flash Image as visual reference, then re-analysing Panel 1's output with Gemini Vision to extract a "consistency anchor" a precise character description prepended to every subsequent panel prompt. The result is a comic where the same person appears recognisably across all five panels. That's never trivial with generative AI. Building a system that understands what matters, not just what's there. The MediaAnalyst agent doesn't describe photos, it analyses them collectively to extract semantic importance. Which photo is the emotional peak? What recurring symbols define this memory's visual identity? What phrases does the person naturally use when they tell this story out loud? That distinction between describing media and understanding it is what makes Mémoire's output feel personal rather than generic.
Shipping a complete multimodal pipeline in one hackathon. Text. Images. Audio. Video. All interleaved. All from a single memory upload. All streaming live to a cinematic frontend. All running on Google Cloud. I'm proud that it works — and that the output is something worth watching.
What I learned
Semantic importance extraction is the biggest quality lever. The MediaAnalyst agent which analyses what matters most across all photos together rather than describing each photo individually was the single biggest improvement in output quality. Once the system understood that "the 3MTT Summit banner appears in every photo and must be preserved," it appeared correctly in every generated panel.
Multi-modal AI pipelines require careful sequencing. Running everything concurrently feels efficient but creates quality problems (inconsistent characters, rate limits, missing dependencies). The sequential Panel 1 to consistency anchor to concurrent Panels 2–5 pattern was the right architecture.
The emotional arc matters as much as the visual accuracy. A technically accurate illustration of a memory isn't enough, the story needs an arc. The comic script's five-beat structure (anticipation to building to climax to reflection to resolution) is what makes the output feel like a film rather than a slideshow.
What's next for Mémoire
Built With
- fastapi
- google-cloud
- google-cloud-run
- nextjs
- python
- typescript
- vertex-ai

Log in or sign up for Devpost to join the conversation.