Halftime
Inspiration
Advertising in streaming is broken. You're watching a tense courtroom scene in Suits, fully immersed in Harvey Specter's closing argument, and suddenly—a jarring 30-second spot for car insurance rips you out of the moment. The ad has nothing to do with what you're watching, who you are, or what you might actually want.
We asked ourselves: What if ads weren't interruptions, but natural extensions of the content?
What if, instead of cutting to a separate commercial, a character in the show you're watching naturally picks up a product that's relevant to your interests? What if the "ad" was actually just the show continuing, but with a Coca-Cola can appearing on the table, or someone putting on Beats headphones before a big scene?
That's Halftime—an end-to-end AI pipeline that analyzes video content, understands context, identifies the perfect moment, and generates seamless product placements that feel like they were always meant to be there.
What it does
Halftime is a complete AI-powered product placement platform with three core components:
1. Intelligent Ad Placement Engine
Our multi-pass analysis system finds the perfect moment to insert a product:
- Pass 1 (Transcript Analysis): Grok analyzes subtitles to find natural dialogue gaps, scene transitions, and contextually relevant moments
- Pass 2 (Visual Verification): Grok Vision examines actual video frames at candidate timestamps, selecting placements where products could realistically appear
2. AI Video Generation Pipeline
Once we identify where to place an ad:
- Extract the target segment with precise timestamps
- Generate AI video using WaveSpeed's Wan 2.5 model that naturally integrates the product
- Seamlessly stitch the AI-generated content back into the original video
- Output a complete video where the "ad" is indistinguishable from the original content
3. Dual-Platform Experience
XVideos (Viewer App): A Netflix-style streaming interface where users watch shows with embedded AI placements. OMDB integration provides rich metadata, and our custom player handles the seamless playback.
Advertiser Dashboard: Brands can:
- Search for and onboard their company
- Define target demographics and interests
- Set content exclusions (e.g., no alcohol ads in kids' content)
- Monitor real-time analytics: impressions, engagement, conversions
How we built it
Architecture Overview
Frontend Layer (Next.js 15)
├── XVideos (Viewer App)
├── Dashboard (Advertiser Portal)
└── Shared Components (Supabase, OMDB, etc)
│
▼
Backend Layer (FastAPI)
├── Auth & Onboarding
├── Analytics Engine
└── Video Processing Pipeline
│
┌────┼────┐
▼ ▼ ▼
AI Video Storage
│ │ │
├─Grok 4.1 ├─FFmpeg (HW Accel) ├─Cloudflare R2
└─WaveSpeed └─MoviePy └─Supabase
The Multi-Pass AI Pipeline
Step 1: Transcript Parsing
# Parse SRT/VTT subtitles and detect dialogue gaps
gaps = transcript_parser.find_gaps(subtitle_path, min_gap=2.0)
Step 2: Candidate Selection (Grok 4.1)
# Grok analyzes transcript context to find 5 best placement candidates
candidates = grok_client.find_candidate_placements(
transcript=parsed_transcript,
product_info={"product": "Beats Headphones", "category": "audio"},
user_interests=["music", "tech", "fashion"]
)
Step 3: Visual Verification (Grok Vision)
# Extract frames at each candidate timestamp
frames = frame_extractor.extract_frames_at_timestamps(video_path, candidates)
# Grok Vision analyzes all frames in a single multi-image request
best_placement = grok_client.select_best_placement_from_frames(
frames=frames,
product_info=product_info
)
Step 4: AI Video Generation (WaveSpeed)
# Extract the segment we want to modify
segment = extract_segment(video_path, start=buffer_start, end=buffer_end)
# Generate AI video with product placement
prompt = "The video continues smoothly. A person naturally picks up or interacts with Beats Headphones. Keep the original visual style."
ai_video = wavespeed_client.generate_video(
video_url=upload_to_temp_hosting(segment),
prompt=prompt,
duration=5
)
Step 5: Seamless Insertion (FFmpeg)
# Hardware-accelerated video stitching
insert_segment(
original_path=video_path,
ai_clip_path=ai_video_path,
cut_start=buffer_start,
cut_end=buffer_end,
output_path=output_path
)
# Uses VideoToolbox for 10x faster encoding
Key Technical Decisions
Why Grok for placement analysis?
grok-4-1-fasthandles transcript analysis with reliable JSON output at low temperaturegrok-2-vision-latestcan analyze multiple frames in a single request, comparing 5+ candidate timestamps efficiently- The combination gives us context-aware placement that considers both dialogue and visual appropriateness
Why WaveSpeed (Wan 2.5)?
- Only video-to-video model that handles copyrighted content without filtering
- Generates 5-10 second clips that maintain visual continuity
- Outputs compatible codecs for seamless stitching
Why FFmpeg with Hardware Acceleration?
- VideoToolbox (Mac) / NVENC (Windows) reduces encoding time by 10x
- Single filter_complex command ensures proper timestamp alignment
- Auto-detection of source video specs (resolution, fps, audio channels) for perfect matching
Challenges we ran into
1. The "Morphing Character" Problem
Problem: When we prompted WaveSpeed to "have SpongeBob hold Lay's chips," the AI would literally morph SpongeBob INTO a bag of chips.
Solution: We refined our prompt engineering to explicitly separate the product appearance from character interaction.
2. Audio/Video Desynchronization
Problem: After inserting AI clips, the video would pause but audio kept playing, then video would speed up to catch up.
Solution: We discovered the issue was timestamp discontinuity at segment boundaries. Fixed by using setpts=PTS-STARTPTS to reset timestamps for each segment before concatenation.
3. Codec Mismatch Nightmares
Problem: Original videos were HEVC 1920x1080 5.1 audio. AI clips came back as H264 1262x720 stereo. FFmpeg concat failed with "parameters do not match."
Solution: Built dynamic detection that probes the original video and applies matching transforms to AI clips.
4. The Upload Reliability Saga
Problem: Temporary file hosting services (file.io, 0x0.st) would randomly fail, breaking our WaveSpeed pipeline.
Solution: Implemented cascading fallback across 5 hosting providers with retry logic.
5. Same Scene Selected for Different Products
Problem: Grok kept selecting the same "men in Italian suits" scene for both Beats headphones AND Tide detergent—clearly inappropriate for laundry detergent.
Solution: Enhanced the Grok Vision prompt to reason about product-scene fit, rejecting transition shots and requiring realistic product placement.
Accomplishments that we're proud of
End-to-End Pipeline
From raw video + subtitles to finished product-placed content in a single command:
python main.py suits_input.json --process --ai --multipass
Sub-5-Minute Processing
A 43-minute episode processes in under 5 minutes thanks to hardware-accelerated encoding, parallel API calls, and stream copying for unchanged segments.
Invisible Ads
Our generated placements are genuinely hard to spot. When we showed test videos to friends, they couldn't identify which scenes were AI-generated.
Production-Ready Infrastructure
- Cloudflare R2 for global video delivery (no egress fees!)
- Supabase for auth that just works
- Real-time analytics tracking impressions, clicks, conversions
Dual-Platform Experience
Both a consumer-facing streaming app AND an advertiser dashboard, each with distinct UX appropriate for their audience.
What we learned
Grok is Incredibly Versatile
We used a multi-model strategy that played to each model's strengths:
- grok-4-1-fast: Fast, reliable JSON for transcript analysis and candidate generation
- grok-2-vision-latest: Multi-image analysis for visual verification
The two-pass architecture (transcript to visual) consistently outperformed single-pass approaches.
Prompt Engineering is Everything for Video AI
WaveSpeed's Wan 2.5 is powerful but literal. We went through 15+ prompt iterations before finding prompts that produced natural placements instead of floating products or morphed characters.
FFmpeg is Harder Than It Looks
We thought video stitching would be the easy part. It wasn't. Always reset timestamps, match ALL parameters (resolution, fps, SAR, audio channels), and use hardware acceleration.
Context > Demographics
Traditional ad targeting focuses on who's watching. We found that what's happening in the scene matters more. A tech product placed during a tech-related dialogue scene performs better than one shown to a "tech enthusiast" during an unrelated moment.
What's next for Halftime
Platform Integration Layer
XVideos is our proof-of-concept. The real vision is Halftime as an invisible layer that sits on top of existing streaming platforms via API.
Advertiser Feedback Loop
We want to implement reinforcement learning where brands upvote/downvote generated placements, and the system learns their brand voice and visual preferences over time.
Live Content Support
Real-time transcript analysis for live streams with sub-second placement decisions.
Personalized Placements
Different viewers watching the same content could see different products—User A sees Gatorade, User B sees Starbucks, same scene.
Built With
- cloudflare
- fastapi
- ffmpeg
- grok
- moviepy
- next.js
- python
- react
- supabase
- tailwindcss
- typescript
- wavespeed
Built With
- cloudflare
- fastapi
- grok
- hls.js
- pydantic
- supabase


Log in or sign up for Devpost to join the conversation.