Inspiration & Challenge

We recently built Storytopia, a fun, multi-agent AI application that helps make screen time less passive and more creative for children. A major problem we noticed is that kids spend too much time passively consuming content online, instead of being authors of their own ideas. Storytopia lets kids draw their favorite characters and watch them come to life in an interactive, 8-scene storybook – complete with questions that encourage critical thinking and learning. Parents can also guide the learning experience by personalizing lesson themes like kindness, values, and real-life skills. The app uses a coordinated system of 3 AI agents built with the Google Agents SDK, powered by Gemini and Imagen models via Vertex AI for text and image generation. After hosting our app on Google Cloud Run, we were very excited to see kids & parents genuinely enjoy creating their own drawings & stories.

But as the AI engineers behind Storytopia, we faced a new challenge in monitoring the performance of our agents. Traditional metrics told us the app was “healthy,” but they couldn't answer the questions we were actually curious about:

  • Are the AI agents in our app always producing high-quality & safe outputs across different kids’ drawings and stories?
  • Where do the AI agents struggle? Are there any specific drawings / themes that the LLMs find difficult to analyze or generate?

We wanted a way to understand whether Storytopia was truly delivering the experience we promised – not just whether APIs were responding or servers were up. That curiosity led us to Datadog LLM Observability.

What we built

We built a Datadog-powered observability layer into our existing app Storytopia using Datadog’s LLM Observability SDK to capture rich telemetry on our AI agents.

What makes our observability approach unique is that we go beyond standard LLM telemetry – such as cost, latency, errors, and request traces – to also stream custom AI Agent Evaluation Telemetry that reflects how our multi-agent system behaves in production. These 5 unique signals are: Creative Intent, Inappropriate Content, Lesson Alignment, Visual Consistency, and Story Narration Status. These observability signals are crucial as they help us measure agent quality, safety indicators, and creative consistency across different user drawings and story inputs. On top of these signals, we configured custom Datadog alert monitors that trigger when evaluation metrics degrade, with actionable case context to quickly investigate and respond to issues. We put all these metrics together in a custom Datadog dashboard for Storytopia, (View here) with multiple widgets to monitor LLM operational insights and agent-level performance evaluations. For more technical details on our implementation, see below!

Instrumentation and Tracing

We instrumented the Storytopia backend using Datadog’s LLM Observability SDK alongside Datadog’s Python tracing library (ddtrace) to establish end-to-end visibility across our FastAPI service. Every user request is traced as it flows through API handling, multi-agent orchestration, LLM calls, and tool executions, providing a complete view of application behavior.

Storytopia LLM Metrics Dashboard

We were able to capture & stream to our dashboard:

  • LLM request latency, token counts, and error rates
  • LLM call timing and cost signals (Computing average cost per story created)
  • End-to-end traces spanning our 3 AI agents (Visionizer, Quest Creator and Illustrator agents) with insights on prompt inputs, outputs, failure cases, etc.

Our Strategy: 5 Custom LLM Evaluation Signals for Smarter Agent Monitoring

As an innovative component, we designed 5 custom, externally computed LLM evaluation signals that capture how our AI agents perform beyond traditional system metrics. These signals are: Creative Intent, Inappropriate Content, Lesson Alignment, Visual Consistency, and Story Narration Status. These evaluations are computed by invoking an external evaluation system (AgentOps), which analyzes our agent outputs and returns structured scores with human-readable explanations. These evaluations are attached to active trace spans using Datadog’s LLM Observability SDK and streamed as first-class telemetry into Datadog.

Each evaluation is defined by:

  • A clear label describing what is being measured
  • A normalized metric value (typically a score between 0 and 1, or a binary 0/1 flag)
  • A pass/fail assessment derived from configurable thresholds
  • Contextual metadata tying the evaluation back to a specific agent, task, and user interaction

Rationale for Custom Evaluation Metrics

By correlating these 5 unique signals – Creative Intent, Inappropriate Content, Lesson Alignment, Visual Consistency, and Story Narration Status – with request traces, we can monitor silent quality and safety failures, understand where specific agents struggle in our Multi-Agent pipeline.

Custom Alert Monitors with Actionable Context

We built 7 custom Datadog alert monitors directly on top of Storytopia’s custom LLM evaluation signals to detect degradations in AI agent behavior in near real time. Each monitor includes actionable context, such as explicit descriptions of what signal degraded, why it matters to the user experience, and guided next steps (e.g., reviewing recent LLM traces, validating prompt changes, inspecting safety false positives, or identifying high-cost agent loops).

Criteria for Each of Our Detection Rules

Monitor Name Detection Criteria Purpose / Impact
Inappropriate Content Detection Spike Avg inappropriate_content_flag0.6 over 5 minutes Surfaces Visionizer safety failures, false positives, or prompt abuse early to protect child safety
Lesson Prompt Drift Detection Avg lesson_alignment_score < 0.6 over 5 minutes Ensures agents remain aligned with educational objectives, not just creative output
Character Visual Consistency Degradation Avg illustrator_consistency < 0.7 over 5 minutes Catches character identity drift across scenes to preserve narrative continuity
Story Voice (TTS) Agent Failure tts_status success rate < 50% Detects narration failures caused by TTS errors, quota exhaustion, or queue issues
Character Description Quality Regression Avg creative_intent_score < 0.5 over 5 minutes Flags low-fidelity character descriptions that reduce immersion and visual grounding
Average Cost Per Story Spike Computed cost per story > $1.80 (rolling 5 minutes) Prevents budget overruns from token bloat, agent loops, or unintended model upgrades
LLM Response Latency Spike Avg LLM span duration > 18s Catches performance regressions that degrade the real-time storytelling experience

Alert Monitors in Datadog

Together, this approach turns LLM observability from passive monitoring into an active control layer – enabling fast, precise interventions and continuous improvement of our Storytopia multi-agent AI system.

Challenges we ran into

Instrumenting a multi-agent generative AI system and attaching custom external evaluations to the right trace spans was challenging, especially when navigating the SDK and figuring out where in the workflow each evaluation should be emitted. We found Datadog’s GitHub resources + Jupyter notebook demos for the LLM Observability SDK super helpful! We taught ourselves how to submit external evaluations, attach them to active traces, and verify that the resulting telemetry appeared correctly across our Datadog dashboard and traces.

Looking Ahead

Storytopia is now running in production, and this observability strategy gives us the confidence to monitor how our AI agents behave in real user scenarios – not just in controlled tests. By detecing silent quality and safety issues early, we can continuously refine prompts, improve agent coordination, and raise the overall experience for kids and parents using our app! Try out Storytopia today - Open Storytopia Web App (Best viewed on Desktop or iPad as mobile layout not yet fully optimized).

🔗 View Storytopia's Datadog Dashboard
https://p.us5.datadoghq.com/sb/c7d3a3fb-e381-11f0-84dd-828a698ac737-4c196bb0c4c7eefb0add01f19bff8a5f

Built With

Share this project:

Updates