AgentScore: Cost Optimization for AI Agent Workflows

Inspiration

According to McKinsey, 88% of organizations are using AI and 80% say efficiency is a top priority but almost none are achieving it.

We've experienced this firsthand. Multi-agent AI systems make dozens of LLM calls per task, and developers have zero visibility into what's happening inside. They see the output and the bill, but they can't see the waste: duplicate calls, overpriced models, bloated prompts, all hidden in the workflow.

Existing tools like LangSmith and Helicone can tell you what you spent. They can't tell you what you should have spent. We built AgentScore to close that gap.

What it does

AgentScore captures every LLM call your agents make, sends the full trace to Gemini for analysis, and shows you exactly where you're overspending with specific fixes for each issue.

It identifies three categories of waste we call "The Three Sins":

Redundant Calls: Agents asking semantically identical questions in different words. Only AI can catch this since text matching misses it entirely.

Model Overkill: Expensive models used for trivial tasks, like GPT-4o formatting bullet points when a lightweight model gives the same result.

Prompt Bloat: Calls stuffed with repeated or irrelevant context. We found prompts sending 500+ tokens when 50 would do.

Each finding comes with a confidence score (only ≥0.7 shown) and an actionable fix.

In our demo, a travel planning workflow costing $16.57 per run was cut to $4.83. More than half the cost was eliminated. At 1,000 runs/day, that's over $300,000/month in savings from a single workflow.

How we built it

SDK (Python): A LangChain callback handler that captures every LLM call automatically. Uses Python's contextvars for thread-safe trace ID management. To use AgentScore, all you need are two lines to integrate and no changes are needed for the existing agent code.

Backend (FastAPI + Supabase): Receives events, stores them in PostgreSQL, and orchestrates Gemini analysis.

Analysis Engine (Gemini 3): The entire workflow trace goes to Gemini in a single API call. Gemini's million-token context window is essential: analyzing multiple LLM calls for semantic patterns requires seeing everything at once.

Scoring & Pricing: Deterministic efficiency scoring (0-100). Cost calculations use our own pricing engine with rates for 15+ models.

Frontend (React + TypeScript): Dashboard showing workflows, overview page showing all calls, efficiency scores, detailed findings with suggested fixes, call trace showing specific findings, and projected savings at scale.

Challenges we ran into

Rate limits: Google's free tier made testing expensive workflows difficult. We built retry logic with fallback models and created snapshot-based testing to minimize API calls during development.

Gemini response consistency: Early testing showed inconsistent call ID references for the same patterns. We tuned the analysis prompt for deterministic output and added structured JSON formatting instructions.

Analysis prompt engineering: Getting Gemini to return the exact findings we needed took many iterations. We had to refine how we asked Gemini to identify waste, reference specific call IDs, assign confidence scores, and format actionable recommendations. Each revision brought the output closer to something developers could actually act on.

Accomplishments that we're proud of

We built a deterministic pricing engine covering 15+ models across OpenAI, Anthropic, and Google so that every dollar amount on the dashboard is calculated from hardcoded rates, not LLM guesses. Getting consistent analysis results from Gemini took real work. We had to lock down temperature, enforce structured JSON output, and iterate on the prompt until findings referenced specific call IDs reliably across runs. The full call trace lets developers click into any individual call to see exactly what was sent, what came back, what it cost, and why it was flagged. Also, integration is two lines of code which is to import the handler, pass it to your model, and now you're capturing everything.

What we learned

Understanding how Gemini processes data matters. We had to learn how our workflow traces were being tokenized and what formatting choices affected the quality of analysis. Structuring prompts vs responses, ordering calls chronologically vs by cost, including full responses vs truncated ones.

Consistency requires determinism. Early results varied wildly between runs. We learned that temperature, structured output formatting, and explicit instructions for call ID references all needed to be locked down. The analysis engine only became reliable once we controlled every variable that could introduce randomness.

Show the fix, not just the problem. Early versions only flagged waste. Adding specific remediation steps like which model to switch to, where to add caching, and what context to trim transformed it from a diagnostic tool into an optimization platform.