control-room-awaiting
control-room-recovered

StudioFlow

Track

Dynatrace

Inspiration

Modern streaming platforms process thousands of high-fidelity media assets daily. Behind the scenes, these are massive event-driven pipelines—when they break, they fail silently, drop frames, or quietly exhaust memory. For Site Reliability Engineers (SREs), waking up at 3:00 AM to dig through traces to find why a 4K transcode job failed is an exhausting, manual process.

We challenged ourselves to answer: What if an AI could not just summarize logs, but actively reason over distributed traces, identify the root cause, and propose a fix—while keeping the human firmly in control?

This inspired StudioFlow, an exploration into autonomous SRE operations using Gemini 3.1 Pro and Dynatrace observability.

What it does

StudioFlow consists of two parts:

The Substrate: A fully simulated, event-driven media production pipeline (Ingest - Transcode - Enrichment - Review - Publish) running as microservices on Google Cloud Run.
The Agent: Built with the Google Agent Development Kit (ADK), this SRE-copilot agent utilizes the Dynatrace Remote MCP Server to act as its "eyes."

When an incident occurs (e.g., our encode service starts dropping memory under high concurrency), the StudioFlow Agent automatically fields the alert, queries Dynatrace for the failing service, correlates the incident with recent Git commits, and drafts a remediation plan (e.g., "Rollback commit and scale memory"). Critically, the agent halts execution at a Human Approval Gate, only executing the final infrastructure changes once an operator clicks 'Approve' in the Studio UI.

How we built it

We built StudioFlow to mirror a real-world enterprise stack:

The Pipeline: We wrote microservices in Python (using FastAPI and real ffmpeg for transcoding). Services communicate asynchronously via Pub/Sub and persist state in Firestore and Cloud Storage. We built a sleek, Next.js frontend to visualize the pipeline flow.
The Observability: Instead of mocking data, we fully instrumented each microservice using the opentelemetry-sdk and OTLP exporters. Every trace, span, and span event flows directly into Dynatrace.
The Brain: We used Gemini 3.1 Pro via Vertex AI as our core reasoning engine. We chose 3.1 Pro specifically for its high-level reasoning capabilities regarding complex JSON trace data.
The Agent Framework: Using the open-source Google ADK, we equipped the agent with four main tools:
1. The Dynatrace MCP Server (for telemetry and DQL queries).
2. A Pipeline Control API.
3. A Git history reader.
4. A custom HumanApprovalGate tool.

Challenges we ran into

Generating "Authentic" Failures: AI agents are only as impressive as the problems they solve. We couldn't just mock a failure; we had to engineer a deliberately fragile Transcode service. We scripted real bottlenecks (memory leaks during large 4K processes) so OpenTelemetry would capture genuine OOM (Out Of Memory) signals and cascading latency spikes.
Implementing the Human Gate: A major constraint was keeping the operator in control. If an LLM decides to delete a service, it shouldn't just run. Implementing the HumanApprovalGate required creating a tool in the ADK that blocked via a Pub/Sub callback, pausing the agent's execution loop until the Next.js UI registered an operator's approval.
Wrangling Traces via MCP: Ensuring the agent consistently filtered massive troves of trace data required precise prompting and constraining the timeframe bounds of the Dynatrace DQL queries so it wouldn't hit token limits.

Accomplishments that we're proud of

Bridging standard OTel with Agentic AI: Watching the agent autonomously fetch a specific Dynatrace trace ID, read the span exception (oom_killed=true), and accurately diagnose the pipeline bottleneck felt like a glimpse into the future of operations.
The Human Approval Loop: The agent interface doesn't feel like a standard chatbot—it feels like a professional incident timeline. The UI renders the agent's structured JSON decisions as clear "Action Proposed" cards that operators can confidently approve.

What we learned

Model Context Protocol (MCP) is a game-changer. By utilizing the partner Dynatrace MCP server, we saved hours of writing custom API wrappers and authentication headers. We handed the server keys to the agent, and it immediately knew how to query our telemetry.
Reasoning over Chatting: Gemini 3.1 Pro excels when it's given clear personas and operating rules ("Diagnose before acting," "Cite evidence"). We learned that prompting for deterministic SRE actions is wildly different than prompting for text generation.

What's next for StudioFlow

In the real world, production is messy. We plan to expand StudioFlow's toolset to allow the agent to draft and send incident post-mortems via email, automatically draft Jira tickets for engineering, and simulate dynamic traffic rerouting during regional outages.

Built With

cloud-run
dynatrace
fastapi
firestore
gemini
google-adk
javascript
mcp
opentelemetry
pub-sub
python
vertex-ai

Updates

Alan Maizon started this project — May 14, 2026 06:19 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.