XVideos
Coca Cola in Suits
Lays in Spongebob
Beats Headphones in Friends
Halftime Dashboard

Halftime

Inspiration

Advertising in streaming is broken. You're watching a tense courtroom scene in Suits, fully immersed in Harvey Specter's closing argument, and suddenly—a jarring 30-second spot for car insurance rips you out of the moment. The ad has nothing to do with what you're watching, who you are, or what you might actually want.

We asked ourselves: What if ads weren't interruptions, but natural extensions of the content?

What if, instead of cutting to a separate commercial, a character in the show you're watching naturally picks up a product that's relevant to your interests? What if the "ad" was actually just the show continuing, but with a Coca-Cola can appearing on the table, or someone putting on Beats headphones before a big scene?

That's Halftime—an end-to-end AI pipeline that analyzes video content, understands context, identifies the perfect moment, and generates seamless product placements that feel like they were always meant to be there.

What it does

Halftime is a complete AI-powered product placement platform with three core components:

1. Intelligent Ad Placement Engine

Our multi-pass analysis system finds the perfect moment to insert a product:

Pass 1 (Transcript Analysis): Grok analyzes subtitles to find natural dialogue gaps, scene transitions, and contextually relevant moments
Pass 2 (Visual Verification): Grok Vision examines actual video frames at candidate timestamps, selecting placements where products could realistically appear

2. AI Video Generation Pipeline

Once we identify where to place an ad:

Extract the target segment with precise timestamps
Generate AI video using WaveSpeed's Wan 2.5 model that naturally integrates the product
Seamlessly stitch the AI-generated content back into the original video
Output a complete video where the "ad" is indistinguishable from the original content

3. Dual-Platform Experience

XVideos (Viewer App): A Netflix-style streaming interface where users watch shows with embedded AI placements. OMDB integration provides rich metadata, and our custom player handles the seamless playback.

Advertiser Dashboard: Brands can:

Search for and onboard their company
Define target demographics and interests
Set content exclusions (e.g., no alcohol ads in kids' content)
Monitor real-time analytics: impressions, engagement, conversions

How we built it

Architecture Overview

Frontend Layer (Next.js 15)
├── XVideos (Viewer App)
├── Dashboard (Advertiser Portal)  
└── Shared Components (Supabase, OMDB, etc)
         │
         ▼
Backend Layer (FastAPI)
├── Auth & Onboarding
├── Analytics Engine
└── Video Processing Pipeline
         │
    ┌────┼────┐
    ▼    ▼    ▼
   AI   Video  Storage
   │     │      │
   ├─Grok 4.1   ├─FFmpeg (HW Accel)   ├─Cloudflare R2
   └─WaveSpeed  └─MoviePy             └─Supabase

The Multi-Pass AI Pipeline

Step 1: Transcript Parsing

# Parse SRT/VTT subtitles and detect dialogue gaps
gaps = transcript_parser.find_gaps(subtitle_path, min_gap=2.0)

Step 2: Candidate Selection (Grok 4.1)

# Grok analyzes transcript context to find 5 best placement candidates
candidates = grok_client.find_candidate_placements(
    transcript=parsed_transcript,
    product_info={"product": "Beats Headphones", "category": "audio"},
    user_interests=["music", "tech", "fashion"]
)

Step 3: Visual Verification (Grok Vision)

# Extract frames at each candidate timestamp
frames = frame_extractor.extract_frames_at_timestamps(video_path, candidates)

# Grok Vision analyzes all frames in a single multi-image request
best_placement = grok_client.select_best_placement_from_frames(
    frames=frames,
    product_info=product_info
)

Step 4: AI Video Generation (WaveSpeed)

# Extract the segment we want to modify
segment = extract_segment(video_path, start=buffer_start, end=buffer_end)

# Generate AI video with product placement
prompt = "The video continues smoothly. A person naturally picks up or interacts with Beats Headphones. Keep the original visual style."

ai_video = wavespeed_client.generate_video(
    video_url=upload_to_temp_hosting(segment),
    prompt=prompt,
    duration=5
)

Step 5: Seamless Insertion (FFmpeg)

# Hardware-accelerated video stitching
insert_segment(
    original_path=video_path,
    ai_clip_path=ai_video_path,
    cut_start=buffer_start,
    cut_end=buffer_end,
    output_path=output_path
)
# Uses VideoToolbox for 10x faster encoding

Key Technical Decisions

Why Grok for placement analysis?

grok-4-1-fast handles transcript analysis with reliable JSON output at low temperature
grok-2-vision-latest can analyze multiple frames in a single request, comparing 5+ candidate timestamps efficiently
The combination gives us context-aware placement that considers both dialogue and visual appropriateness

Why WaveSpeed (Wan 2.5)?

Only video-to-video model that handles copyrighted content without filtering
Generates 5-10 second clips that maintain visual continuity
Outputs compatible codecs for seamless stitching

Why FFmpeg with Hardware Acceleration?

VideoToolbox (Mac) / NVENC (Windows) reduces encoding time by 10x
Single filter_complex command ensures proper timestamp alignment
Auto-detection of source video specs (resolution, fps, audio channels) for perfect matching

Challenges we ran into

1. The "Morphing Character" Problem

Problem: When we prompted WaveSpeed to "have SpongeBob hold Lay's chips," the AI would literally morph SpongeBob INTO a bag of chips.

Solution: We refined our prompt engineering to explicitly separate the product appearance from character interaction.

2. Audio/Video Desynchronization

Problem: After inserting AI clips, the video would pause but audio kept playing, then video would speed up to catch up.

Solution: We discovered the issue was timestamp discontinuity at segment boundaries. Fixed by using setpts=PTS-STARTPTS to reset timestamps for each segment before concatenation.

3. Codec Mismatch Nightmares

Problem: Original videos were HEVC 1920x1080 5.1 audio. AI clips came back as H264 1262x720 stereo. FFmpeg concat failed with "parameters do not match."

Solution: Built dynamic detection that probes the original video and applies matching transforms to AI clips.

4. The Upload Reliability Saga

Problem: Temporary file hosting services (file.io, 0x0.st) would randomly fail, breaking our WaveSpeed pipeline.

Solution: Implemented cascading fallback across 5 hosting providers with retry logic.

5. Same Scene Selected for Different Products

Problem: Grok kept selecting the same "men in Italian suits" scene for both Beats headphones AND Tide detergent—clearly inappropriate for laundry detergent.

Solution: Enhanced the Grok Vision prompt to reason about product-scene fit, rejecting transition shots and requiring realistic product placement.

Accomplishments that we're proud of

End-to-End Pipeline

From raw video + subtitles to finished product-placed content in a single command:

python main.py suits_input.json --process --ai --multipass

Sub-5-Minute Processing

A 43-minute episode processes in under 5 minutes thanks to hardware-accelerated encoding, parallel API calls, and stream copying for unchanged segments.

Invisible Ads

Our generated placements are genuinely hard to spot. When we showed test videos to friends, they couldn't identify which scenes were AI-generated.

Production-Ready Infrastructure

Cloudflare R2 for global video delivery (no egress fees!)
Supabase for auth that just works
Real-time analytics tracking impressions, clicks, conversions

Dual-Platform Experience

Both a consumer-facing streaming app AND an advertiser dashboard, each with distinct UX appropriate for their audience.

What we learned

Grok is Incredibly Versatile

We used a multi-model strategy that played to each model's strengths:

grok-4-1-fast: Fast, reliable JSON for transcript analysis and candidate generation
grok-2-vision-latest: Multi-image analysis for visual verification

The two-pass architecture (transcript to visual) consistently outperformed single-pass approaches.

Prompt Engineering is Everything for Video AI

WaveSpeed's Wan 2.5 is powerful but literal. We went through 15+ prompt iterations before finding prompts that produced natural placements instead of floating products or morphed characters.

FFmpeg is Harder Than It Looks

We thought video stitching would be the easy part. It wasn't. Always reset timestamps, match ALL parameters (resolution, fps, SAR, audio channels), and use hardware acceleration.

Context > Demographics

Traditional ad targeting focuses on who's watching. We found that what's happening in the scene matters more. A tech product placed during a tech-related dialogue scene performs better than one shown to a "tech enthusiast" during an unrelated moment.

What's next for Halftime

Platform Integration Layer

XVideos is our proof-of-concept. The real vision is Halftime as an invisible layer that sits on top of existing streaming platforms via API.

Advertiser Feedback Loop

We want to implement reinforcement learning where brands upvote/downvote generated placements, and the system learns their brand voice and visual preferences over time.