Inspiration

Long-form video often arrives as a single, unstructured block of data. Currently, the industry relies on "dumb" segmentation—like placing ad breaks at arbitrary 10-minute intervals—which frequently cuts mid-sentence or mid-action, driving 15-30% lower viewer completion rates. Manual segmentation is a bottleneck, costing between $50 and $100 per hour of content in editor time. We were inspired to build a system that moves beyond simple shot detection to understand the "breath" of a story, identifying natural narrative boundaries where viewers are most receptive to transitions.

What it does

SegmentIQ is an AI-powered decision engine that transforms raw video perception into optimized editorial segments. It operates in three distinct modes:

  • Ad-Break Mode: Identifies 5-7 organic pause points (like timeouts or resolved plays) in sports or entertainment for seamless commercial placement.
  • News Mode: Automatically segments broadcasts into individual stories with accurate topic labels.
  • Structural Mode: Detects the functional skeleton of episodic content, including cold opens, act breaks, and credits. The system provides a pro-editor review interface where users can verify boundaries using side-by-side "filmstrip" frames and audio waveform evidence before exporting to industry-standard JSON, XML, or EDL formats.

How we built it

We architected a 3-phase multimodal scoring pipeline:

  • Phase 1 (Ingestion): We used TwelveLabs Marengo-Embed 3.0 via AWS Bedrock for temporal visual embeddings and Pegasus 1.2 for semantic chaptering and ASR. Simultaneously, we extracted audio RMS and silence signals using librosa and ffmpeg.
  • Phase 2 (Scoring): We developed a weighted heuristic engine that fuses visual cosine distance, audio silence duration, and Pegasus semantic scores into a single "Boundary Confidence" map.
  • Phase 3 (Selection): A greedy, spacing-aware optimizer selects the top-K non-overlapping breaks based on the chosen workflow's constraints. The frontend was built with Next.js 14 and Zustand to allow for zero-latency video seeking and "nudging," while the backend was deployed on Baseten using FastAPI.

Challenges we ran into

One of our primary challenges was balancing technical sophistication with the "Production Viability" required for a 48-72 hour build. We initially designed a 6-phase research pipeline involving complex Dynamic Programming, but realized it was prone to "dependency hell" and debugging rabbit holes. We had to pivot to a narrowed Minimum Viable Winning Product (MVWP), replacing the DP optimizer with a weighted heuristic that proved equally effective for the demo. Additionally, managing the state between a high-resolution video player and real-time editor controls forced us to move away from Streamlit to a more robust Next.js architecture.

Accomplishments that we're proud of

  • The Decision Layer: We successfully built an intelligence layer that doesn't just "see" video but "decides" on it, filling a critical gap in the TwelveLabs ecosystem.
  • Multimodal Fusion: Our engine effectively synchronizes visual, auditory, and semantic signals to achieve high precision in boundary placement.
  • The Editor Interface: We created a "kickass" UI featuring frame-accurate filmstrips and audio waveform overlays that allow editors to trust the AI's decisions instantly.
  • Validation: The system maintains high performance, processing a 60-minute video in under 30 minutes while hitting our target F1 scores.

What we learned

We learned that "Perception" is only half the battle in AI video workflows. While models like Marengo and Pegasus are incredibly powerful at describing what is happening, the "Decision Intelligence"—the logic that determines if a moment is a good break—is where the real business value lies. We also reinforced the importance of grounding AI prompts in established theory, such as using Event Segmentation Theory (EST) to guide Pegasus in identifying narrative closure.

What's next for Narrative AI

The future of SegmentIQ involves deeper integration into the professional creative suite:

  • Generative Transitions: Using LTX Video to automatically generate 3-second visual bumpers or transitions based on Pegasus scene summaries.
  • NLE Integration: Direct plugins for Adobe Premiere and DaVinci Resolve so editors can pull SegmentIQ markers directly onto their timelines.
  • Real-Time Processing: Optimizing the pipeline to support live-stream segmentation for dynamic ad insertion (DAI).
  • Narrative Tension Calibration: Incorporating LLM-based emotional arc analysis to further refine ad placement during moments of "resolved tension".

Built With

Share this project:

Updates