Be Concise — The Story Behind the Build What Inspired Me I've always believed that great storytelling shouldn't be gatekept by technical skill. Watching content creators spend 6–8 hours editing a 10-minute video — scrubbing timelines, hunting for the perfect cut, agonizing over pacing — felt like a problem worth solving.
The spark came from a simple question:
"What if an AI could watch your footage and think like a seasoned editor?"
Not just cut on silence. Not just transcribe. But reason about narrative arc, emotional energy, and audience engagement — the way a human producer would.
What I Learned Building Be Concise taught me that video editing is fundamentally a language problem. Every cut is a sentence. Every chapter is a paragraph. The whole video is an argument.
This realization unlocked the multi-agent architecture: specialized AI agents, each owning one editorial dimension:
$$ \text{Final Video} = f(\text{SW} \oplus \text{VD} \oplus \text{MA} \oplus \text{MU} \oplus \text{VA}) $$
Where each agent $A_i$ contributes a scored selection over the segment space $S$:
$$ \text{score}(s_i) = \alpha \cdot \text{quality}(s_i) + \beta \cdot \text{narrative_fit}(s_i) + \gamma \cdot \text{energy}(s_i) $$
I also learned that AI confidence is not AI correctness — the hardest engineering work was building validation loops, feedback resolution, and graceful degradation when an agent produced something plausible but wrong.
How I Built It The system is a microservices pipeline where each service owns exactly one concern:
Upload → Transformer → Audio Worker → Video Worker → Brain → Final Export
Copy Transformer produces the 480p proxy and fires off video insight chunking in parallel
Audio Worker runs Gladia for speaker-diarized transcription, then AnchorMerge to align word-level timestamps
Video Worker runs TwelveLabs scene indexing and Gemini gap analysis concurrently
Brain (Python / LangGraph) orchestrates six AI agents across two phases — analysis then production
All coordination is queue-driven via Azure Storage Queues, with Azure Table Storage as the state backbone
The frontend is a React editor where users can review the AI's editorial decisions segment by segment — approve, reject, split, join, or give free-text feedback that re-triggers the agent pipeline.
The Challenges
- Transcript Alignment Was Brutally Hard Raw transcription outputs from different providers (Deepgram, Gladia, Gemini) have incompatible timestamp conventions — some in milliseconds, some in seconds, some drifting by hundreds of milliseconds. I built AnchorMerge: a sliding-window alignment algorithm that scores word-level anchors against sentence boundaries, with fallback interpolation for gaps.
The drift correction math:
$$ t_{\text{corrected}}(w) = t_{\text{prev}} + \frac{t_{\text{next}} - t_{\text{prev}}}{\text{span}} \cdot \Delta_w $$
where $\text{span} = \max(0,\ t_{\text{next}} - t_{\text{prev}})$ prevents division by zero on back-to-back anchors.
AI Agents That Lie Confidently The SegmentLabeler — which classifies every transcript block as GOLD / SILVER / BRONZE — would silently truncate JSON when the response exceeded token limits, producing valid-looking but corrupt output. The fix: double the output budget, add responseMimeType: json, and write a regex fallback that salvages partial arrays.
Keeping Services Truly Independent The Video Insight flow (Gemini scene analysis) had to run alongside ingestion without ever touching the brain trigger. One misrouted queue message would fire the AI pipeline prematurely. The solution was a dedicated video-insight-jobs queue with an explicit Type contract, and a strict if/else if chain in the Initiator that prevents any COMPLETED-style value from leaking into the wrong branch.
The Export Was Broken in Ways That Were Hard to See The final export pipeline had a cascade of silent failures: the wrong step name ("GENERATE_FINAL" vs "FINAL_VIDEO_GENERATION") meant the frontend could never confirm completion. The FINAL_VIDEO_READY signal was simply dropped by the Initiator. And the generated video URL was never included in the resources response — the video existed in blob storage but was invisible to the UI. Each bug individually seemed small; together they made the export feel completely broken.
The Bigger Lesson $$ \text{Good AI product} = \underbrace{\text{Great AI}}{\text{the easy part}} + \underbrace{\text{Reliable plumbing}}{\text{the hard part}} $$
The AI agents were — honestly — the fun part. The hard, unglamorous work was the contract between services: making sure every message had a handler, every status had a writer, every URL had a reader. That invisible infrastructure is what separates a demo from a product.
Log in or sign up for Devpost to join the conversation.