How Gemini "Watches" Video
Step 1: Frame Sampling
The model doesn't watch every frame (that would be computationally impossible). Instead:
Original Video: 30 fps × 60 seconds = 1,800 frames
↓
Sampled Frames: ~1-2 frames per second = 60-120 frames
Google's system intelligently selects frames, likely using:
- Regular interval sampling (e.g., every 0.5-1 second)
- Keyframe detection (scene changes, significant motion)
- Adaptive sampling (more frames during action, fewer during static scenes)
Step 2: Visual Tokenization
Each sampled frame is converted into tokens (like words for images):
Frame 1 → [tok_001] [tok_002] [tok_003] ... [tok_256]
Frame 2 → [tok_257] [tok_258] [tok_259] ... [tok_512]
Frame 3 → [tok_513] [tok_514] [tok_515] ... [tok_768]
...
Each frame might become ~256-512 tokens, using a Vision Transformer (ViT) that:
- Splits the image into patches (e.g., 16×16 pixel squares)
- Converts each patch into an embedding vector
- These become the "visual tokens"
Step 3: Temporal Position Encoding
The model needs to know the order of frames. This is done through positional embeddings:
Frame 1 tokens + [Position: t=0.0s]
Frame 2 tokens + [Position: t=0.5s]
Frame 3 tokens + [Position: t=1.0s]
...
This is similar to how text models know word order, but extended to time.
Step 4: Transformer Attention (The Magic)
The self-attention mechanism is what enables understanding across frames:
┌─────────────────────────────────────────────────────────────┐
│ ATTENTION MECHANISM │
├─────────────────────────────────────────────────────────────┤
│ │
│ Frame 1 Frame 2 Frame 3 Frame 4 │
│ [person [person [person [person │
│ standing] walking] running] jumping] │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ▼ │
│ "Person accelerates from standing │
│ to walking to running to jumping" │
│ │
└─────────────────────────────────────────────────────────────┘
Attention allows every token to "look at" every other token, meaning:
- Frame 3 can compare itself to Frame 1 (detect changes)
- The model sees patterns across time (motion)
- Relationships emerge (cause → effect)
Step 5: Multimodal Fusion
Text prompt + Video tokens are processed together:
┌─────────────────────────────────────────────────────────────┐
│ │
│ YOUR PROMPT VIDEO TOKENS │
│ "Describe the [Frame1][Frame2][Frame3]... │
│ people and │
│ activities" │
│ │ │ │
│ └──────────┬───────────┘ │
│ ▼ │
│ TRANSFORMER MODEL │
│ │ │
│ ▼ │
│ GENERATED RESPONSE │
│ "A woman in a video is interacting │
│ with holographic displays..." │
│ │
└─────────────────────────────────────────────────────────────┘
What Enables Each Capability
| Capability | How It Works |
|---|---|
| Visual content | ViT tokenizes each frame into semantic patches |
| Motion detection | Attention compares same regions across frames |
| Scene transitions | Large visual changes between frames trigger detection |
| Temporal order | Positional embeddings encode time sequence |
| "First X, then Y" | Attention + temporal encoding = causal understanding |
Concrete Numbers (Gemini 2.5 Flash)
| Aspect | Approximate Value |
|---|---|
| Context window | 1 million tokens |
| Tokens per frame | ~256-512 |
| Max frames analyzed | ~2,000-4,000 frames |
| Max video length | ~1 hour (depending on sampling) |
| Sampling rate | ~1-2 fps typically |
Key Insight
The model doesn't "watch" video like humans do (continuously). Instead, it:
- Sees a strategically sampled set of frames
- Encodes each frame as tokens with timestamps
- Compares all frames simultaneously via attention
- Reasons about the relationships
It's more like seeing the whole video "at once" as a collection of moments, rather than experiencing it sequentially like we do.
=== Interpreter ``` === Test: Video analysis and cataloging
=== Interpreter ===
Analyzing your video...
This may take a moment depending on video length.
=== Interpreter ``` ===
Video Analysis Complete
```YouTube Ready
Title ```:
Holographic Presentation in a Futuristic City - Global Virtual Meeting
Description ```:
2026-01-13
The video showcases a holographic projection of a woman interacting with a digital globe and various virtual interfaces, set against a stunning backdrop of a futuristic city at dusk. Subsequently, several individuals appear in separate holographic screens, suggesting a global virtual meeting or presentation, with a stylized human figure evolving within the central digital globe. Stylized, non-standard text like 'CCOOWCVTV' and 'U090 9 0518' is visible throughout the scene.
Featured:
• Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair.
• Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands.
• Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile.
• Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her.
• Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top.
• A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top.
• Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands.
• Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing.
Location:
• Futuristic Cityscape at Dusk/Dawn
Google Sheets Data
Column headers (paste in row 1 if new sheet):
Date,Title,People,Locations,Activities,Summary
Data row (paste in next empty row):
2025-12-13,Holographic Presentation in a Futuristic City - Global Virtual Meeting,"female_1: Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair.; female_2: Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands.; female_3: Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile.; female_4: Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her.; child_1: Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top.; male_1: A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top.; female_5: Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands.; female_6: Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing.",Futuristic Cityscape at Dusk/Dawn,"A central female figure (holographic projection) interacts with a digital globe and other holographic elements.; Multiple individuals are displayed in separate holographic screens, seemingly participating in a virtual meeting or presentation.; The central digital globe displays a stylized human figure (initially an organ, then a full body).","The video showcases a holographic projection of a woman interacting with a digital globe and various virtual interfaces, set against a stunning backdrop of a futuristic city at dusk. Subsequently, several individuals appear in separate holographic screens, suggesting a global virtual meeting or presentation, with a stylized human figure evolving within the central digital globe. Stylized, non-standard text like 'CCOOWCVTV' and 'U090 9 0518' is visible throughout the scene."
Detailed Breakdown
People Identified
| female_1 | Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair. | | female_2 | Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands. | | female_3 | Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile. | | female_4 | Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her. | | child_1 | Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top. | | male_1 | A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top. | | female_5 | Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands. | | female_6 | Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing. |
Locations
| location_1 | Futuristic Cityscape at Dusk/Dawn | An elevated view of a sprawling modern city with numerous skyscrapers, set against a sky transitioning between day and night. Traditional-style roofs are visible in the immediate foreground. |
Activities
- A central female figure (holographic projection) interacts with a digital globe and other holographic elements.
- Multiple individuals are displayed in separate holographic screens, seemingly participating in a virtual meeting or presentation.
- The central digital globe displays a stylized human figure (initially an organ, then a full body).
=== Interpreter ``` ===
✓ All assertions passed!
Log in or sign up for Devpost to join the conversation.