Inspiration

Every team building with LLM APIs faces the same quiet inefficiency: they pick one model and use it for everything. The most capable one they can afford, or the cheapest one they can get away with. There's no middle ground - no system that says "this subtask needs a $0.15/million-token model and this one needs a $15/million-token model, and here's how to tell the difference."

The cost isn't just financial. In high-stakes domains like fraud detection or financial analysis, using the wrong model in the wrong direction - too cheap where precision matters, too expensive where it doesn't - produces worse outcomes at higher prices. That problem felt worth building against.


What it does

Tokenwise is an agentic task orchestration engine. You paste in any complex task. Tokenwise:

  1. Decomposes it into 3–7 subtasks using an orchestrator agent, estimating the complexity of each
  2. Routes each subtask to the cheapest model capable of handling it across a three-tier ladder (GPT-4o-mini / Claude Haiku → GPT-4o / Claude Sonnet → o1 / Claude Opus)
  3. Executes subtasks in dependency order, passing prior outputs as context where needed
  4. Validates each output using a lightweight validator agent with routing-hint-aware rubrics
  5. Escalates on failure - retry same model, then escalate one tier, then surface the error
  6. Composes all subtask outputs into a final coherent response
  7. Tracks every token, dollar, and escalation - surfacing actual cost vs. what all-Tier-3 would have cost

The dashboard streams every agent decision live: which subtask is running, on which model, at which tier, why it escalated if it did, and what it cost. Historical totals accumulate across every run.


How we built it

Backend: FastAPI with asyncio for parallel DAG execution of independent subtasks. SQLite persists every run and subtask with full token counts, cost, escalation history, and Tier-3 baseline for savings comparison. SlowAPI enforces per-IP rate limiting and a daily spend guard. The orchestration pipeline is modular - orchestrator, tier router, executor, validator, escalation manager, and composer are each discrete components with defined interfaces.

Routing logic: The tier router assigns models based on subtask complexity (low/medium/high) and routing hint (general reasoning, structured output, instruction following, code generation, creative synthesis). Provider affinity is encoded per hint - structured output routes to Anthropic, code generation routes to OpenAI. Escalation memory tracks which hint+tier combinations have historically failed and auto-promotes starting tiers over time.

Models: OpenAI (GPT-4o-mini, GPT-4o, o1) and Anthropic (Claude Haiku 3.5, Claude Sonnet 4, Claude Opus) via their respective APIs. Both adapters normalize to a single runner contract. o1-specific constraints (no system role, max_completion_tokens instead of max_tokens, no temperature) are handled inside the adapter.

Frontend: React + Vite with a WebSocket connection for live event streaming. The agent timeline, run economics strip, and historical totals all update in real time as events arrive. Composed output renders as markdown. The full app is served as static files from the FastAPI backend - single Railway service deployment.

Testing: 73 pytest tests covering the orchestrator, tier router, escalation manager, validator, cost tracker, history store, runtime coordinator, rate limiting, and API endpoints. All LLM calls are mocked.


Challenges we ran into

o1 API constraints. OpenAI's o1 model rejects system role messages, max_tokens, and temperature - all three of which the runner was sending. The adapter needed o1-specific detection and parameter remapping before any Tier-3 OpenAI call would succeed.

Railway WebSocket timeouts. Railway's proxy closes WebSocket connections after 60 seconds of inactivity. Complex tasks - particularly ones with Tier-3 subtasks - take longer than that. The fix was a keepalive ping task running alongside the event queue, plus client-side auto-reconnect that replays the backlog from the RunEventHub on reconnect.

Validator over-strictness. Early versions of the validator failed outputs on stylistic grounds - "narrative flow could be improved" - rather than genuine incompleteness. This caused unnecessary escalations that inflated cost and broke the savings math. The fix was explicit instruction in the validator prompt: only fail for truncation, placeholder text, or fundamental rubric failure. When in doubt, pass.

JSON output format drift. The orchestrator occasionally returned output_format: "code" for technical subtasks - a value not in the enum. Post-parse normalization was added to rewrite invalid format values before schema validation, alongside stricter prompt guidance on when to use json vs markdown.

PORT expansion in Docker on Railway. Railway injects $PORT at runtime but the variable wasn't expanding correctly through uv run uvicorn. The fix was a shell entrypoint script (start.sh) that runs uvicorn directly from the venv with ${PORT:-8000}.


Accomplishments that we're proud of

  • 73 tests passing across every component of the system - router, validator, escalation, cost math, history, runtime, and API
  • 35% average cost savings across 21 live runs on genuinely complex tasks - not toy examples
  • 39% savings on fraud detection specifically, the highest-stakes domain we tested, where quality mattered most and escalations were real
  • The escalation memory system working correctly in production - the router actually gets smarter across a session
  • A fully deployed, single-service Railway application with WebSocket streaming, persistent history, rate limiting, and a daily spend guard - not a demo notebook

What we learned

The hardest design decision in the whole system wasn't the routing logic - it was the quality signal. Without a reliable way to know whether a subtask output is good enough, the router is just guessing. The validator agent is what makes the system defensible rather than fragile.

Routing hint → provider affinity matters more than expected. Once structured output tasks were explicitly routed to Anthropic and code generation to OpenAI, first-attempt pass rates improved meaningfully - fewer escalations, lower cost, same or better quality.

o1 is genuinely different from other OpenAI models at the API level, not just the capability level. Building a unified runner that handles both o1 and GPT-4o cleanly required more adapter logic than anticipated.

The savings narrative only works if the baseline is honest. Measuring savings against Tier-3 pricing for the actual tokens used - not an estimate - is what makes the numbers defensible in a pitch.


What's next for Tokenwise

Plugin SDK. The routing and escalation logic should be droppable into any existing LLM workflow without rebuilding the pipeline. A lightweight SDK that accepts a task graph and returns routed, validated outputs is the natural next step.

Fine-tuned routing classifier. The current router uses heuristics - complexity labels and routing hints assigned by the orchestrator. A classifier trained on historical pass/fail data per hint+tier combination would make routing decisions empirically rather than instructionally.

Compliance mode. For regulated industries, sensitive subtasks need to stay on-premise or on approved providers. A routing flag that enforces provider constraints per task type is directly applicable to fintech and healthcare deployments.

Streaming output. Currently the composed response appears only after all subtasks complete. Streaming partial output as each subtask finishes would significantly improve perceived latency on long tasks.

Team dashboards. Per-user spend quotas, routing profiles shared across a team, and aggregate savings reporting across an organization's LLM usage - the SQLite schema already captures everything needed.

Built With

Share this project:

Updates