Inspiration
It started with a single screenshot: a simple "summarise this Google Sheet" request that returned 547,000 tokens of raw JSON, four times the GPT-4o context window, for a 10K-row table whose answer was a single integer. The agent never needed 99% of what came back. We realised this happens to every Composio-powered agent, on every turn, on every tool, and there's no abstraction to measure the damage, let alone fix it.
Composio is brilliant at executing tools. We wanted a layer that was equally serious about the bill, one that would tell you, in honest tiktoken-accurate numbers, exactly where each token went and what it would cost on each model. From that first sheet, Quava was the answer to a question we kept asking: "why is the model paying for fields it never reads?"
What it does
Quava sits between Composio and your LLM and shrinks every tool call along five compounding axes:
- Token attribution: per-call tiktoken counts for schema, arguments, raw result, and compressed result, attributable per tool, toolkit, and run.
- Effort routing: low / medium / high / auto
modes;auto` classifies ask difficulty (simple → deep) and picks compression depth + schema detail per call. - Output compression: schema-aware pruning, flattening, list compaction, tool-specific normalizers , TOON encoding for tabular records, and caveman-style stopword pruning on long prose.
- Safe caching: Only explicitly approved read-only tools are cached, and cached results are isolated by tenant, user, and connected account. Write actions and auth flows are never cached.
The live agent path runs a real Claude tool-use loop with parallel Composio execution, a per-model price ledger, and a counterfactual USD figure that tells you exactly what the same run would have cost without Quava. The React dashboard renders the full trace with metrics computed from the run data.
How we built it
Stack: Python 3.11 + FastAPI + tiktoken on the backend, React 19 + Vite + Tailwind + base-ui on the frontend, Upstash Redis for distributed caching with an in-memory fallback for local dev.
The compression pipeline. Every tool result flows through the same engine:
raw payload
-> tool-specific normalize
-> field policy resolve (5-tier, asymmetric)
-> tabular | record | object dispatch
-> mode-specific prune/flatten (safe | balanced | low | aggressive)
-> caveman stopword pruning (aggressive only)
-> TOON encoding (when denser than JSON) (best for tabular)
-> emit token + cache events
Asymmetric field policy. The tool API's return lots of fields irrelevant to desired goal, which bloats the JSON and reduces the signal for the model to reason over. Every field traverses five tiers in cheapest-first order. Only the static denial list and a missing decision can DROP, every other fields must be kept.
- Required signal (explicit dot-path or substring) -> keep
- User ask mentions the field name (word-bounded, snake/camel/space variants) -> keep
- Optional model classifier promoted it -> keep
- Static denial list (
avatar_url,node_id, …) -> drop - Default -> keep
We use a small classifier — Llama-3.1-8B on Groq or HuggingFace, that gets to promote fields the user's ask actually needs. It runs with a hard latency budget of 275 ms and a version-keyed cache.
Quality gate. When auto mode runs with required_signals, we walk modes from aggressive -> low -> balanced -> safe, picking the cheapest mode whose compressed output still contains every signal as a substring or dot-path. The result is the most aggressive mode that's safe for this specific ask on this specific schema. Resulting in the best signal for the lowest number of tokens.
Counterfactual cost math. For each run we track Anthropic usage.{input, output, cache_read, cache_write} per iteration. Actual cost is the per-token Anthropic price applied to each bucket and divided by one million:
actual_usd = (p_in · t_in + p_cr · t_cr + p_cw · t_cw + p_out · t_out) / 1_000_000
The counterfactual rebills t_in as t_in + (t_raw − t_sent) at the same model — i.e. what we would have paid had every raw tool result hit the input ledger uncompressed. The delta is what the dashboard reports as saved.
Challenges we ran into
- Gmail response shape. Gmail results came back as deeply nested message objects with headers, MIME parts, body data, snippets, and flattened Composio fields. The first pass did not recurse far enough, and the next pass removed too much. We added a Gmail-specific normalizer that keeps Subject, From, Date, snippet, and readable body text while dropping raw MIME payloads.
- Compression can remove task-critical fields. Flattening
assigneeto a string saves tokens, but it can break questions like “who is assigned to what?” We added task profiles and explicit(tool, task) -> required_fieldsmaps so the engine protects required fields before pruning. - TOON is workload-dependent. Token-Oriented Object Notation is smaller than JSON for uniform tabular records, but it can be worse for irregular nested objects. The engine now renders both JSON and TOON, counts tokens for each, and sends the smaller representation.
- JSON whitespaces. Default
json.dumpsadds spaces after commas and colons, which creates extra tokens at scale. On large tables this was a measurable cost, so we use compact JSON separators,separators=(",", ":"), throughout the tool-result path. - Some token counts are approximate. Anthropic and Google do not publish public tokenizers for every model we use, so those counts fall back to
cl100k_baseand are marked as approximate in the UI.
Accomplishments that we're proud of
- Dashboard metrics come from live runs. The UI does not use hardcoded constants or simulated delays; each metric is computed from the runtime modules with
tiktoken. - Large tool results compress while preserving task answers. The Google Sheet read went from 547K to 5.7K tokens, about a 99% reduction, and the agent still answered “what’s the average follower count?” using representative samples and statistics.
- Anthropic prompt caching is wired into the live agent. The agent emits
cache_control: {"type": "ephemeral"}on the system prompt and final tool schema block. Follow-up iterations within 5 minutes use cached input pricing. The cost ledger tracks both cold and warm paths. - Field policy includes an audit trail. Each keep/drop decision records a reason:
explicit,explicit_descendant,ask,model,denial_list, ordefault, so the dashboard can show why a field survived compression. - Cost savings are priced from the run data. Quava prices the compressed run and the uncompressed baseline on the same model, then reports the dollar delta.
What we learned
- Compression needs a quality check. Dropping an assignee object may reduce tokens, but it breaks the task if the user needs the assignee. Aggression has to be gated by the fields the answer depends on.
- Model-assisted field selection should be one-way. The classifier can promote fields back into the payload, but it cannot remove fields. If it is slow or wrong, it cannot make compression more destructive.
- Provider caching compounds after the first turn. Tool-result compression reduces the payload once. Prompt caching reduces repeated prefix cost on later turns. They work best when the payload is compressed before it becomes part of the cached prefix.
- Upstream field selection is the cleanest savings path. Returning only the fields needed for the task avoids fetching irrelevant data, which is safer than trimming it after it arrives.
- Approximate counts should be labeled. Claude and Gemini token counts are surfaced as
approximate=Trueinstead of being presented as exact measurements.
What's next for Quava
- Integrate Composio meta-tools through an MCP proxy. Expose Composio’s discovery and execution meta-tools through MCP so agents can inspect available tools, schemas, and execution paths without custom integration code.on.
- More benchmark towards task completion quality. Track whether the agent completes the task, not only how many tokens were removed.
- Adapters beyond Composio. The LangGraph adapter pattern is in
Quava/adapters/; next targets are OpenAI Agents SDK, CrewAI, and Anthropic Managed Agents API.
Log in or sign up for Devpost to join the conversation.