Inspiration

From booking flights to writing code, AI agents are increasingly becoming a part of consumers' daily lives. However, when they fail, they do so silently without any explanation. You might get a wrong answer, a crashed run, or a soaring API bill, and you have no explanation as to why. Traditional logging tools were built for engineers. They were never designed for someone who just wants to know why their agent failed.

We wanted to build a tool that makes agent behavior observable to the average consumer. That idea became Tenor.

What it does

Tenor is a real-time observability platform for AI agents. It visualizes agent execution traces as a live graph, where every LLM call, tool invocation, planning step, and error appears as a node the moment it happens.

Key capabilities include:

Live Execution Graph: a React Flow-powered DAG that builds in real-time as your agent runs, with color-coded nodes by step type and status

Step Inspector: click any node to see the full prompt, completion, tool arguments, token counts, cost, and error details

Run Explorer: a searchable, filterable table of all historical runs across your agent fleet

AI Optimization: a Claude-powered analysis engine that ingests historical run data, computes fleet-wide statistics, and returns a structured optimization report with an 0–100 score, model swap recommendations, and automation opportunities

How we built it

The stack is split cleanly between a Python backend and a TypeScript frontend.

Backend (FastAPI):

REST endpoints handle run and step creation using a universal data contract. Any agent that can POST JSON can integrate with Tenor

A WebSocket manager streams step events to connected clients in real-time as they're emitted

An analytics engine processes historical CSV data (1,000 runs, 7,388 steps) using pure Python, computing per-scenario stats, per-model cost breakdowns, and error hotspot detection

The optimization endpoint sends the pre-computed analytics summary to Claude via the Anthropic SDK and returns a structured JSON report, cached for 10 minutes

Frontend (Next.js 14)

React Flow renders the live DAG with dagre for automatic hierarchical layout

TanStack Query manages data fetching and cache invalidation

A custom WebSocket hook subscribes per-run and patches new nodes into the graph state

shadcn/ui and Tailwind CSS handle the component library and styling, with full dark mode support

The data model is Postgres-compatible from day one, with an in-memory store used for the demo to keep the setup friction-free.

Challenges we ran into

Structuring the Claude prompt for the optimization report took significant iteration. Getting Claude to return consistent, well-typed JSON across varied fleet compositions, without hallucinating metric values, required tight prompt engineering and explicit schema definitions in the system prompt.

Accomplishments that we're proud of

A real-time graph interface that builds node by node is viscerally satisfying.

The AI Optimization dashboard delivers insights that feel immediately actionable, not generic. Seeing a specific model swap suggestion with an estimated cost saving is the kind of output that makes the Claude integration feel worthwhile.

The universal data contract means Tenor isn't locked to any one agent framework; it's designed to observe anything.

What we learned

Building for real-time observability forces you to think carefully about data contracts upfront. The shape of a Step object touches every layer of the system, like the simulator, the WebSocket events, the frontend graph, the inspector panel, the analytics engine, and the Claude prompt.

We also learned that Claude is remarkably good at synthesizing structured analytics into human-readable, prioritized recommendations, but only when you give it well-structured input.

What's next for Tenor

Persistent storage: swapping in the Postgres backend with multi-tenant support and user authentication

Alerting: threshold-based alerts when error rates spike, costs exceed budgets, or latency degrades

Comparative run diffing: select two runs of the same agent and see exactly where their execution paths diverged

Expanded AI analysis: moving from fleet-level optimization to per-run post-mortems, where Claude explains exactly what went wrong and how to fix it

Built With

Share this project:

Updates