VibeCut

Inspiration

Everyone has thousands of photos and videos on their phone: people you care about, pets being ridiculous, birthday candles, street scenes from trips half-forgotten. They sit there gathering digital dust.

We asked: what if Gemini 3 could understand your entire gallery, then turn your real life into an anime music video?

The thesis is simple: The world needs an extremely personalized creative assistant on your phone. human content at the core, human directed, AI assisted. Your memories are the raw material. Gemini is the creative engine. You steer.

What it does

Upload your photo/video gallery. VibeCut's agentic pipeline:

Understands your media in one multimodal call, reasoning across your entire gallery to recognize pets, people, places, and moods (Gemini 3 Flash, 1M long context)
Creates characters as anime character sheets faithful to your real subjects at 2K resolution (Gemini 3 Pro Image)
Builds a story with you through agentic conversation with personalized questions and options, never templates (Gemini 3 Flash, native tool calling)
Draws a 4-panel manga with dialogue, camera directions, and cross-panel character consistency via dual-anchor technique (Gemini 3 Pro Image + Pro reasoning)
Animates each panel into a 4-second video clip with subtle motion (Veo 3.1)
Writes and performs a song with lyrics shaped by the story, self-reviewed by Gemini for quality (Gemini 3 Pro + ElevenLabs vocals)
Composes the final video as a 16s vertical music video with karaoke captions

Output: a 1080x1920 music video with original vocals, from phone photos to finished content.

How we built it

Agentic-First UX

The core design decision: zero hardcoded conditionals for creative content. The entire UX is driven by Gemini's agentic tool calls.

There is no if (character.type === 'pet') return ['Cozy day', 'Adventure']. Gemini sees the full session context (which subjects have been shown, loved, or skipped; which characters are saved; what story is in progress) and decides what to do next via native function calling. The system prompt is rebuilt every turn with the latest state.

The agent has composable skills it can invoke: analyze_gallery, show_card, create_character, ask_story_question, confirm_story, create_manga. Each skill is a focused, testable unit. Gemini orchestrates them based on context, not a fixed script. Every user gets a different experience because the AI is reasoning about their content.

Following Google's A2UI pattern, each agent turn renders one rich in-place widget (message + images + reaction buttons + action buttons) inside the chat. No modal dialogs, no page navigations, just a conversational flow of cards you tap through. Mobile-first.

Four Gemini 3 Models in One Pipeline

Model	Gemini 3 Capability	Role
`gemini-3-flash-preview`	Multimodal understanding, 1M context, low-latency reasoning	Analyzes mixed photo/video galleries in one call. Powers the agentic conversation loop with native tool calling.
`gemini-3-pro-preview`	Deep reasoning, self-evaluation	Writes lyrics with self-review quality gate. Generates character appearance descriptions for manga consistency. Creates per-section musical styles. Verifies final output visually.
`gemini-3-pro-image-preview`	Native image generation, interleaved output	Character sheets (full-body + portrait at 2K). 4-panel manga with dialogue and camera directions.
`veo-3.1-fast-generate-preview`	Video generation from image input	Minimal-motion animation (breathing, blinking, environmental sway) from manga keyframes. 4s per panel.

For dialogue mode, character voices use Qwen3-TTS and word-level timestamps use Qwen3-ForcedAligner (~30ms precision). Music mode uses ElevenLabs for vocals with Gemini-generated per-section styles.

Dual-Anchor Character Consistency

Before generating manga panels, Gemini Pro reasons about each character sheet and produces a text description: "orange tabby with green eyes, blue collar with bell charm, short fluffy fur with darker stripes." This description is embedded alongside the visual reference in every panel prompt. Two anchors (visual + textual) that prevent the image model from drifting, even in complex action scenes.

Architecture

  YOUR GALLERY (photos + videos)
              |
              v
  +=========================================+
  |     AGENTIC UX LOOP (Gemini 3 Flash)    |
  |  Session context rebuilt every turn.     |
  |  6 composable skills via tool calling.   |
  |  A2UI widgets: cards, buttons, reactions.|
  |                                          |
  |  User taps  ──>  Gemini reasons  ──+    |
  |    ^                                |    |
  |    +────────  next widget  <────────+    |
  +=========================================+
        |          |            |
        v          v            v
  +-----------+  +-----------+  +---------------------------+
  |1. UNDER-  |  |2. CHAR-   |  |3. STORY + MANGA          |
  |   STAND   |  |   ACTER   |  |                           |
  | Flash     |  | Pro Image |  | Pro: character description|
  | 1M-context|  | 2K sheets |  |   (dual-anchor text)      |
  | multimodal|  | from refs |  | Pro Image: 4-panel manga  |
  +-----------+  +-----------+  |   dialogue + camera dirs  |
                                +---------------------------+
                                            |
                                +-----------+-----------+
                                |                       |
                                v                       v
                  +-------------------+   +-------------------+
                  | 4a. ANIMATE       |   | 4b. MUSIC         |
                  | Veo 3.1           |   | Pro: lyrics +     |
                  | 4 x 4s clips      |   |   self-review     |
                  | minimal motion    |   | ElevenLabs: song  |
                  +-------------------+   |   with vocals     |
                                |         +-------------------+
                                |                       |
                                +-----------+-----------+
                                            |
                                            v
                              +---------------------------+
                              | 5. COMPOSE + VERIFY       |
                              | FFmpeg: concat + captions  |
                              |   -> 1080x1920 h264+aac  |
                              | Pro: visual verification  |
                              +---------------------------+
                                            |
                                            v
                                      MUSIC VIDEO
                                16s vertical, karaoke captions

Key insight: The agentic loop wraps everything. Gemini Flash doesn't just run the pipeline linearly; it decides when to invoke each skill based on the user's choices, reactions, and session state. The user steers through A2UI widgets, and Gemini reasons about what to do next.

Backend: Python/FastAPI with 12 composable skills (~10K lines). No GPU required. Deployed live at vibecut.whatif.art.

Challenges we ran into

Character drift in manga panels. Action scenes caused the image model to hallucinate different characters. Fixed with dual-anchor: Gemini Pro reasons about the character sheet and produces a text description that anchors every panel prompt alongside the visual reference.
Agentic UX state management. With Gemini deciding every next step, the system prompt must faithfully represent the full session state. We rebuild context every turn: gallery analysis, shown subjects, user reactions, saved characters, story progress. Context engineering turned out to be real engineering.
Model selection within the family. Flash for speed, Pro for quality, Pro Image for generation. Using Pro for lyrics instead of Flash made a huge difference. The self-review gate (Gemini rates lyrics on storytelling/singability/energy, regenerates if < 7) pushed quality further.
Mixed media reasoning. Photos go inline, but videos must go through Gemini's Files API (upload, poll, URI). We needed filename-based media labels ([Media X: filename]) so Gemini could accurately cross-reference indices across 30+ mixed items in one 1M-context call.
Production resilience. Veo generation takes ~50s per clip, creating minutes of silence that killed SSE connections through Cloudflare Tunnel. Fixed with per-clip progress events and keepalive heartbeats.

22 bugs documented and fixed, each one a lesson in building production AI pipelines where every API call is non-deterministic.

Accomplishments that we're proud of

Agentic-first architecture. The AI decides the creative flow from the first interaction. Zero hand-wired conditionals. Every user gets a genuinely different experience because Gemini reasons about their content.
Full Gemini 3 model spectrum. Flash (the eyes), Pro (the brain), Pro Image (the hands), Veo 3.1 (the animator). Each model chosen for its strength, all working in concert.
Composable skill architecture. 12 independent skills that the agent orchestrates via tool calls. Each skill is testable in isolation, and the agentic layer composes them into creative workflows that feel seamless.
Dual-anchor character consistency. Visual reference + Gemini Pro text description keeps characters faithful across all panels. We haven't seen this technique documented elsewhere.
Production-grade at vibecut.whatif.art. A live web app with streaming progress, per-session isolation, and zero GPU requirements.

What we learned

Agentic UX beats scripted UX. Letting Gemini decide the creative flow produces more engaging, personalized interactions than any state machine. The key is rebuilding context faithfully every turn.
Composable skills + native tool calling = powerful orchestration. Small, focused skills that Gemini invokes via function calling create emergent creative workflows without complex routing logic.
Multimodal reasoning across a full gallery works. Gemini 3 Flash reasons across 30+ mixed photos and videos in a single 1M-context call, accurately grouping subjects and generating structured analysis.
Self-critique is cheap and effective. Having Gemini Pro rate its own output and regenerate costs one extra API call but noticeably improves quality.
Redundant anchoring across modalities. When you need consistency from a generative model, give it the same information in multiple forms (image + text). The model rarely drifts when both anchors agree.

What's next for VibeCut

Mobile-readiness Make the key features fully ready as a standalone app, and connect with more personal assets on user's device
End-to-end web UI flow in a single uninterrupted session
Video keyframe extraction using Gemini to pick the best frame from uploaded videos as character reference
Longer-form content, extending from 16s clips to full 60-90s music videos with scene transitions
Multi-character ensemble stories with 3+ characters interacting across panels
Community gallery to share generated music videos and remix others' stories with your own characters

Built With

amazon-ec2
cloudflare
elevenlabs
fastapi
gemini
gemini-3-flash
gemini-3-pro
gemini-3-pro-image
python
qwen3-asr
qwen3-tts

Updates

JiAo Dong started this project — Feb 08, 2026 03:30 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.