Inspiration

Short-form video is the dominant content format today, but producing even a 15-second clip requires juggling research, scripting, asset generation, voiceover, and video assembly. We wanted to see if we could compress that entire workflow into a single natural-language prompt — type a topic, get a video. The Arize AI Observability track gave us the perfect excuse to make the pipeline not just autonomous, but self-improving through traced evaluations.

What it does

HyperFrames Agent is a multi-agent video production pipeline built on Google ADK. Given a topic (e.g. "create a 15-second video about the solar system"), it:

  1. Researches the topic using the model's training data
  2. Proposes a structured video concept
  3. Writes a narration script
  4. Plans scenes with visual descriptions
  5. Generates images via Google Imagen and narration via Cloud Text-to-Speech
  6. Reviews and refines all content
  7. Composes a HyperFrames HTML document with GSAP animations and renders it to MP4

After rendering, it evaluates its own output with an LLM-as-a-Judge and logs quality scores to Arize Phoenix — closing the loop between production and observability.

How we built it

  • Agent framework: Google ADK with a root orchestrator delegating to 7 specialist sub-agents (research, proposal, script, scene plan, assets, edit, compose)
  • Image generation: Google Imagen (imagen-3.0-generate-002) via the Vertex GenAI SDK
  • Audio: Google Cloud Text-to-Speech (en-US-Neural2-J)
  • Video rendering: HyperFrames v0.6.91 — generates HTML5 compositions with GSAP timelines and renders them via Chrome + FFmpeg
  • Observability: OpenInference auto-instrumentation for ADK and GenAI, sending traces to Arize Phoenix Cloud
  • Evaluations: LLM-as-a-Judge (same GenAI model) scores each pipeline run on correctness, completeness, and relevance; scores are logged back to Phoenix as span evaluations
  • Resilience: Exponential backoff with jitter on Vertex AI 429 rate limits, plus ADK's built-in retry config
  • MCP: Phoenix MCP server configured so agents can introspect their own traces at runtime

Every tool function is ToolContext-aware, routing all file outputs to session-scoped directories.

Challenges we ran into

  • HyperFrames HTML structure: The LLM could never generate valid HyperFrames compositions — it would omit data-composition-id, break GSAP timeline syntax, or use incorrect attribute names. We solved this by moving HTML generation server-side: generate_composition_html takes structured scene data and produces guaranteed-valid output.
  • Asset path resolution: HyperFrames' HTTP server is rooted at the project directory, so ../assets/ paths 404'd. Assets must be copied into the project's own assets/ subdirectory with relative paths.
  • 429 rate limits: Vertex AI aggressively throttles. We layered two retry mechanisms — a decorator on individual tool calls and ADK's retry_config on every agent — with exponential backoff, jitter, and invocation-ID-based resumption.
  • Session isolation: Early versions used a single env var for session paths, which broke under concurrent ADK sessions. Refactoring every tool to accept ToolContext fixed this.
  • ADK web vs CLI: The pipeline auto-progresses in CLI mode, but the ADK web UI required careful instruction design so agents wouldn't pause asking the user for confirmation at each step.

Accomplishments that we're proud of

  • End-to-end autonomy: A single prompt produces a complete video — no human in the loop between topic and MP4.
  • Observability loop: The pipeline doesn't just emit traces; it reads them back to evaluate its own quality and inform future runs.
  • Resilient design: The 429 retry with invocation-ID resumption means long pipelines survive rate-limit storms without restarting.
  • Clean architecture: 7 specialist agents, each with a focused skill and tool set, orchestrated by a lightweight root agent. Adding or swapping agents is straightforward.
  • Arize track coverage: All six requirements met — code-owned runtime (ADK), OpenInference instrumentation, Phoenix Cloud traces, MCP introspection, LLM-as-a-Judge evaluations, and the bonus self-improvement via trace queries.

What we learned

  • Auto-instrumentation is magic: phoenix.otel.register(auto_instrument=True) instruments ADK, GenAI, and HTTP calls with zero manual span creation.
  • LLMs can't generate valid HyperFrames HTML reliably: The composition DSL (GSAP timelines, data-composition-id, scene attributes) is too structured for freeform generation. A server-side builder function with validated output is the right approach.
  • Retry is a systems problem, not just a code problem: Decorators catch individual failures, but ADK's retry_config + run_async with invocation_id handle agent-level failures that span multiple tool calls and sub-agent transfers.
  • Session isolation matters early: Refactoring from env-var-based to context-based session paths midway was more painful than doing it right upfront.

What's next for HyperFrames Agent

  • Multi-modal evaluation: Score outputs on visual quality (image coherence, animation smoothness) using a vision-language model judge.
  • A/B testing via Phoenix experiments: Run multiple prompt/parameter variants and compare trace-grounded evaluation scores.
  • Persistent learning: Store evaluation results in a dataset and fine-tune agent instructions based on what scores highest.
  • Voice customization: Support multiple TTS voices and languages.
  • Batch production: Queue multiple topics and produce a playlist of videos.
  • Web UI: A dashboard to view past runs, trace visualizations, and evaluation scores — all powered by Phoenix queries.

Built With

Share this project:

Updates