posted an update

How Gemini "Watches" Video

Step 1: Frame Sampling

The model doesn't watch every frame (that would be computationally impossible). Instead:

Original Video: 30 fps × 60 seconds = 1,800 frames
                            ↓
Sampled Frames: ~1-2 frames per second = 60-120 frames

Google's system intelligently selects frames, likely using:

  • Regular interval sampling (e.g., every 0.5-1 second)
  • Keyframe detection (scene changes, significant motion)
  • Adaptive sampling (more frames during action, fewer during static scenes)

Step 2: Visual Tokenization

Each sampled frame is converted into tokens (like words for images):

Frame 1  →  [tok_001] [tok_002] [tok_003] ... [tok_256]
Frame 2  →  [tok_257] [tok_258] [tok_259] ... [tok_512]
Frame 3  →  [tok_513] [tok_514] [tok_515] ... [tok_768]
    ...

Each frame might become ~256-512 tokens, using a Vision Transformer (ViT) that:

  1. Splits the image into patches (e.g., 16×16 pixel squares)
  2. Converts each patch into an embedding vector
  3. These become the "visual tokens"

Step 3: Temporal Position Encoding

The model needs to know the order of frames. This is done through positional embeddings:

Frame 1 tokens + [Position: t=0.0s]
Frame 2 tokens + [Position: t=0.5s]
Frame 3 tokens + [Position: t=1.0s]
...

This is similar to how text models know word order, but extended to time.


Step 4: Transformer Attention (The Magic)

The self-attention mechanism is what enables understanding across frames:

┌─────────────────────────────────────────────────────────────┐
│                    ATTENTION MECHANISM                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Frame 1        Frame 2        Frame 3        Frame 4     │
│   [person        [person        [person        [person     │
│    standing]      walking]       running]       jumping]   │
│       │              │              │              │        │
│       └──────────────┴──────────────┴──────────────┘        │
│                          │                                  │
│                          ▼                                  │
│            "Person accelerates from standing                │
│             to walking to running to jumping"               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Attention allows every token to "look at" every other token, meaning:

  • Frame 3 can compare itself to Frame 1 (detect changes)
  • The model sees patterns across time (motion)
  • Relationships emerge (cause → effect)

Step 5: Multimodal Fusion

Text prompt + Video tokens are processed together:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   YOUR PROMPT          VIDEO TOKENS                         │
│   "Describe the        [Frame1][Frame2][Frame3]...         │
│    people and                                               │
│    activities"                                              │
│        │                      │                             │
│        └──────────┬───────────┘                             │
│                   ▼                                         │
│           TRANSFORMER MODEL                                 │
│                   │                                         │
│                   ▼                                         │
│         GENERATED RESPONSE                                  │
│   "A woman in a video is interacting                        │
│    with holographic displays..."                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

What Enables Each Capability

Capability How It Works
Visual content ViT tokenizes each frame into semantic patches
Motion detection Attention compares same regions across frames
Scene transitions Large visual changes between frames trigger detection
Temporal order Positional embeddings encode time sequence
"First X, then Y" Attention + temporal encoding = causal understanding

Concrete Numbers (Gemini 2.5 Flash)

Aspect Approximate Value
Context window 1 million tokens
Tokens per frame ~256-512
Max frames analyzed ~2,000-4,000 frames
Max video length ~1 hour (depending on sampling)
Sampling rate ~1-2 fps typically

Key Insight

The model doesn't "watch" video like humans do (continuously). Instead, it:

  1. Sees a strategically sampled set of frames
  2. Encodes each frame as tokens with timestamps
  3. Compares all frames simultaneously via attention
  4. Reasons about the relationships

It's more like seeing the whole video "at once" as a collection of moments, rather than experiencing it sequentially like we do.


=== Interpreter ``` === Test: Video analysis and cataloging

=== Interpreter === Analyzing your video...

This may take a moment depending on video length.

=== Interpreter ``` ===

Video Analysis Complete

```YouTube Ready

Title ```:

Holographic Presentation in a Futuristic City - Global Virtual Meeting

Description ```:

2026-01-13

The video showcases a holographic projection of a woman interacting with a digital globe and various virtual interfaces, set against a stunning backdrop of a futuristic city at dusk. Subsequently, several individuals appear in separate holographic screens, suggesting a global virtual meeting or presentation, with a stylized human figure evolving within the central digital globe. Stylized, non-standard text like 'CCOOWCVTV' and 'U090 9 0518' is visible throughout the scene.

Featured:
• Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair.
• Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands.
• Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile.
• Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her.
• Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top.
• A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top.
• Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands.
• Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing.

Location:
• Futuristic Cityscape at Dusk/Dawn

Google Sheets Data

Column headers (paste in row 1 if new sheet):

Date,Title,People,Locations,Activities,Summary

Data row (paste in next empty row):

2025-12-13,Holographic Presentation in a Futuristic City - Global Virtual Meeting,"female_1: Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair.; female_2: Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands.; female_3: Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile.; female_4: Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her.; child_1: Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top.; male_1: A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top.; female_5: Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands.; female_6: Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing.",Futuristic Cityscape at Dusk/Dawn,"A central female figure (holographic projection) interacts with a digital globe and other holographic elements.; Multiple individuals are displayed in separate holographic screens, seemingly participating in a virtual meeting or presentation.; The central digital globe displays a stylized human figure (initially an organ, then a full body).","The video showcases a holographic projection of a woman interacting with a digital globe and various virtual interfaces, set against a stunning backdrop of a futuristic city at dusk. Subsequently, several individuals appear in separate holographic screens, suggesting a global virtual meeting or presentation, with a stylized human figure evolving within the central digital globe. Stylized, non-standard text like 'CCOOWCVTV' and 'U090 9 0518' is visible throughout the scene."

Detailed Breakdown

People Identified

| female_1 | Central holographic figure, initially full-body, then close-up, then represented as a stylized red human figure. When visible, she wears a white top with a red sash or trim, and has dark, short hair. | | female_2 | Appears in the top-left holographic screen. She has dark, shoulder-length hair, wears a light-colored top (possibly white or light blue), and is shown gesturing with glowing effects around her hands. | | female_3 | Appears in the top-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, and has a slight smile. | | female_4 | Appears in the bottom-left holographic screen. She has dark, shoulder-length hair, wears a white top, and is looking down at a child next to her. | | child_1 | Appears in the bottom-left holographic screen, next to female_4. This child has light-colored hair and is wearing a red top. | | male_1 | A blurry figure in the background of the bottom-left holographic screen with female_4 and child_1. Appears to be an adult male, possibly with a beard, wearing a light-colored top. | | female_5 | Appears in the bottom-center holographic screen. She has dark, shoulder-length hair, wears a blue top, and is gesturing with her hands. | | female_6 | Appears in the bottom-right holographic screen. She has dark, shoulder-length hair, wears a white collared top, has a slight smile, and is gesturing. |

Locations

| location_1 | Futuristic Cityscape at Dusk/Dawn | An elevated view of a sprawling modern city with numerous skyscrapers, set against a sky transitioning between day and night. Traditional-style roofs are visible in the immediate foreground. |

Activities

  • A central female figure (holographic projection) interacts with a digital globe and other holographic elements.
  • Multiple individuals are displayed in separate holographic screens, seemingly participating in a virtual meeting or presentation.
  • The central digital globe displays a stylized human figure (initially an organ, then a full body).

=== Interpreter ``` ===

✓ All assertions passed!

Log in or sign up for Devpost to join the conversation.