Inspiration
Short-form video is the dominant content format today, but producing even a 15-second clip requires juggling research, scripting, asset generation, voiceover, and video assembly. We wanted to see if we could compress that entire workflow into a single natural-language prompt — type a topic, get a video. The Arize AI Observability track gave us the perfect excuse to make the pipeline not just autonomous, but self-improving through traced evaluations.
What it does
HyperFrames Agent is a multi-agent video production pipeline built on Google ADK. Given a topic (e.g. "create a 15-second video about the solar system"), it:
- Researches the topic using the model's training data
- Proposes a structured video concept
- Writes a narration script
- Plans scenes with visual descriptions
- Generates images via Google Imagen and narration via Cloud Text-to-Speech
- Reviews and refines all content
- Composes a HyperFrames HTML document with GSAP animations and renders it to MP4
After rendering, it evaluates its own output with an LLM-as-a-Judge and logs quality scores to Arize Phoenix — closing the loop between production and observability.
How we built it
- Agent framework: Google ADK with a root orchestrator delegating to 7 specialist sub-agents (research, proposal, script, scene plan, assets, edit, compose)
- Image generation: Google Imagen (
imagen-3.0-generate-002) via the Vertex GenAI SDK - Audio: Google Cloud Text-to-Speech (
en-US-Neural2-J) - Video rendering: HyperFrames v0.6.91 — generates HTML5 compositions with GSAP timelines and renders them via Chrome + FFmpeg
- Observability: OpenInference auto-instrumentation for ADK and GenAI, sending traces to Arize Phoenix Cloud
- Evaluations: LLM-as-a-Judge (same GenAI model) scores each pipeline run on correctness, completeness, and relevance; scores are logged back to Phoenix as span evaluations
- Resilience: Exponential backoff with jitter on Vertex AI 429 rate limits, plus ADK's built-in retry config
- MCP: Phoenix MCP server configured so agents can introspect their own traces at runtime
Every tool function is ToolContext-aware, routing all file outputs to session-scoped directories.
Challenges we ran into
- HyperFrames HTML structure: The LLM could never generate valid HyperFrames compositions — it would omit
data-composition-id, break GSAP timeline syntax, or use incorrect attribute names. We solved this by moving HTML generation server-side:generate_composition_htmltakes structured scene data and produces guaranteed-valid output. - Asset path resolution: HyperFrames' HTTP server is rooted at the project directory, so
../assets/paths 404'd. Assets must be copied into the project's ownassets/subdirectory with relative paths. - 429 rate limits: Vertex AI aggressively throttles. We layered two retry mechanisms — a decorator on individual tool calls and ADK's
retry_configon every agent — with exponential backoff, jitter, and invocation-ID-based resumption. - Session isolation: Early versions used a single env var for session paths, which broke under concurrent ADK sessions. Refactoring every tool to accept
ToolContextfixed this. - ADK web vs CLI: The pipeline auto-progresses in CLI mode, but the ADK web UI required careful instruction design so agents wouldn't pause asking the user for confirmation at each step.
Accomplishments that we're proud of
- End-to-end autonomy: A single prompt produces a complete video — no human in the loop between topic and MP4.
- Observability loop: The pipeline doesn't just emit traces; it reads them back to evaluate its own quality and inform future runs.
- Resilient design: The 429 retry with invocation-ID resumption means long pipelines survive rate-limit storms without restarting.
- Clean architecture: 7 specialist agents, each with a focused skill and tool set, orchestrated by a lightweight root agent. Adding or swapping agents is straightforward.
- Arize track coverage: All six requirements met — code-owned runtime (ADK), OpenInference instrumentation, Phoenix Cloud traces, MCP introspection, LLM-as-a-Judge evaluations, and the bonus self-improvement via trace queries.
What we learned
- Auto-instrumentation is magic:
phoenix.otel.register(auto_instrument=True)instruments ADK, GenAI, and HTTP calls with zero manual span creation. - LLMs can't generate valid HyperFrames HTML reliably: The composition DSL (GSAP timelines,
data-composition-id, scene attributes) is too structured for freeform generation. A server-side builder function with validated output is the right approach. - Retry is a systems problem, not just a code problem: Decorators catch individual failures, but ADK's
retry_config+run_asyncwithinvocation_idhandle agent-level failures that span multiple tool calls and sub-agent transfers. - Session isolation matters early: Refactoring from env-var-based to context-based session paths midway was more painful than doing it right upfront.
What's next for HyperFrames Agent
- Multi-modal evaluation: Score outputs on visual quality (image coherence, animation smoothness) using a vision-language model judge.
- A/B testing via Phoenix experiments: Run multiple prompt/parameter variants and compare trace-grounded evaluation scores.
- Persistent learning: Store evaluation results in a dataset and fine-tune agent instructions based on what scores highest.
- Voice customization: Support multiple TTS voices and languages.
- Batch production: Queue multiple topics and produce a playlist of videos.
- Web UI: A dashboard to view past runs, trace visualizations, and evaluation scores — all powered by Phoenix queries.

Log in or sign up for Devpost to join the conversation.