EvalForge

Inspiration

66% of organizations want AI that learns from feedback, but none have systematic pipelines to make it happen.

When LLM agents fail in production:

Incident-to-Insight Loop closes the feedback gap by automatically transforming every Datadog LLM trace failure into three actionable outputs:

Python 3.11 monorepo with multiple small services
FastAPI services (ingestion, extraction, deduplication, generators, approval workflow)
Datadog LLM Observability as the production signal source
Google Cloud Run as the stateless deployment target
Firestore as the shared system of record for traces → patterns → suggestions
Vertex AI (Gemini) for structured pattern extraction and content generation
Vertex AI Embeddings for similarity-based deduplication
OpenAPI contracts per service (stored in specs/*/contracts/)

Detection Rules (Code-Based)

We implemented code-based detection rules in src/ingestion/datadog_client.py that classify LLM failures by analyzing trace attributes:

Data Source: AgentErrorBench dataset traces ingested into Datadog for realistic LLM failure demonstrations.

Datadog App Builder: Interactive approval workflow dashboard for human-governed suggestion review.

Turning noisy real-world traces into consistent structured patterns without overfitting to one example
Keeping the system cost-conscious (timeouts, batching, budgets, avoiding unnecessary LLM calls)
Preserving an end-to-end evidence trail while sanitizing sensitive data
Designing a workflow that is human-governed: suggestions must be reviewable and explicitly approved before export

An end-to-end pipeline from incident → pattern → deduped suggestion with lineage to source traces
A working approval workflow API with atomic status transitions and export endpoints
A modular architecture with separate services by concern, making the system easy to extend
Clear, judge-friendly documentation and contracts to support local runs and future iteration

Observability data is incredibly powerful — but only if you convert it into repeatable, testable artifacts
Structured LLM outputs become much more reliable when you combine schemas, validation, and good prompts
Human-in-the-loop design isn’t a slowdown; it’s a safety feature that makes automation trustworthy

Add guardrail + runbook generators to the local stack for true end-to-end demos
Expand export formats (DeepEval/pytest adapters) while keeping a canonical framework-agnostic JSON source
Improve dashboard automation: scheduled metrics publishing to Datadog for backlog visibility
Add tenant isolation + retention policies for safer production use

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.