Omni-Operator V1

Inspiration

As engineers and creators at Operators Forge, we felt trapped in the "SaaS Tax" cycle. We were paying multiple monthly subscriptions for tools that handle editing, transcription, and social media scheduling—tools that are "black boxes" where you lose control over your data. Our inspiration was to build a Sovereign AI Factory: a local-first production line where the Operator owns the infrastructure, the memory, and the reasoning logic. We wanted to prove that with the power of Gemini 3 Flash Preview, one can run an entire media agency from a single local machine without relying on external SaaS platforms.

What it does

Omni-Operator V1 is an autonomous media factory that transforms raw video footage into a multi-platform content campaign.

  • Analyzes: It "watches" raw MP4 files using Gemini's native multimodality to identify viral hooks. It bypasses traditional transcription, understanding the scene's energy directly.
  • Writes: It generates unique, platform-optimized strategy and copy for TikTok, YouTube, and LinkedIn, validated via PydanticAI to ensure data integrity.
  • Manufactures: It physically executes sub-second precise cuts and performs vertical reframing (9:16) using an automated FFmpeg engine.
  • Remembers: It uses a local Qdrant Vector DB to store campaign data, allowing the system to learn and retrieve the creator's unique style for future missions.
  • Distributes: It uses the Model Context Protocol (MCP) to autonomously organize and manage the local file system.

How we built it

We architected a high-density "Sovereign Stack" designed for autonomy:

  • Cognitive Engine: Gemini 3 Flash Preview via the new google-genai SDK for high-speed multimodal reasoning.
  • Logic & Agency: PydanticAI for type-safe agentic orchestration and structured outputs.
  • Vector Memory: Qdrant running locally in Docker to manage brand experience and RAG capabilities.
  • Observability: Langfuse v2 for local tracing, debugging, and cost-per-mission analysis.
  • Media Engine: A custom Python service controlling FFmpeg and MoviePy to automate the rendering process.
  • Tactical Interface: A professional "Mission Control" dashboard built with Next.js 16 and Tailwind 4.

Challenges we ran into

The primary challenge was bridging the gap between "AI Reasoning" and "Technical Execution." We had to ensure that the timestamp markers identified by Gemini matched perfectly with the frame-accurate requirements of FFmpeg. Additionally, orchestrating an entire enterprise-grade stack (FastAPI, Qdrant, Langfuse, Postgres) within a local Docker environment while ensuring low latency in a Next.js frontend required a rigorous approach to network and resource management on a single machine.

Accomplishments that we're proud of

We successfully built a "Zero SaaS Tax" pipeline. We achieved total data sovereignty—your raw video and brand strategies never leave your controlled environment. We are particularly proud of the Native Multimodal Integration; by removing the need for separate speech-to-text or vision models, we've created a much faster and more cost-effective production line. The system is truly autonomous: from one raw upload to three formatted, described, and sorted video assets.

What we learned

Building this project proved that Gemini 3 Flash is a game-changer for Media-Ops. Its speed allows for real-time iteration, and its massive context window ensures that the agent maintains a coherent narrative across a long video, rather than seeing it in disconnected chunks. We also learned that the future of AI belongs to "Agents with Hands"—systems that don't just chat, but operate directly on file systems and infrastructure through protocols like MCP.

What's next for omni-operator-v1

The next phase is Agentic Quality Control (AQC), where Gemini will autonomously review its own rendered clips against the original mission intent to ensure perfect quality. We are also planning to integrate automated voice cloning and dubbing to allow creators to go global with a single click, and an auto-thumbnail generator that identifies the most visually striking frame from each cut.

Built With

  • docker
  • fastapi
  • ffmpeg
  • gemini-api-(gemini-3-flash-preview)
  • langfuse
  • model-context-protocol-(mcp)
  • moviepy
  • next.js-16
  • pydanticai
  • python-3.12
  • qdrant-vector-db
  • tailwind-css-4
  • typescript
  • uv
Share this project:

Updates

deleted deleted

deleted deleted posted an update

MISSION LOG: Sovereign Stack Fully Operational A quick update on why we chose Gemini 3 Flash Preview as our primary cognitive processor. Unlike traditional pipelines that rely on separate Whisper (STT) and Vision models, Omni-Operator uses Gemini’s native multimodality to 'see' and 'hear' the footage in a single pass. This allowed us to: Eliminate 40% of API latency. Achieve precise 'Temporal Grounding' for automated FFmpeg cuts. Reduce processing costs to near-zero ($0.05 - $0.10 per campaign). Engineering for efficiency is the core of Operators Forge.

Log in or sign up for Devpost to join the conversation.