optimize.ai

Inspiration

We kept running into the same problem: you write a prompt that works great on GPT-5 or Claude Opus, but when you try to run it on a cheaper model to save money, the output quality tanks. We wanted to automate the process of making prompts work well on cheaper models so you get the quality you need without the cost you don't.

What it does

optimize.ai is a prompt optimization engine powered by Keywords AI. You paste in a prompt and a set of requirements, and a 5-agent pipeline takes over:

Profiler - analyzes your prompt and selects the best 4-6 models to test from a pool of 9 across OpenAI, Anthropic, Google, and DeepSeek
Benchmark - runs all selected models in parallel through the Keywords AI gateway
Evaluator - scores each model's output on correctness, completeness, and format, and checks which requirements passed or failed
Optimizer - rewrites your prompt using strategies like chain-of-thought, schema enforcement, or example injection so cheaper models pass the requirements they missed
Advisor - recommends the best model + prompt combo based on your priority (cost, speed, quality, or balanced)

Every API call is routed through Keywords AI, so the entire pipeline is observable in the Keywords AI dashboard. You can see each agent's trace, token usage, and cost breakdown.

How we built it

The app is a Next.js 14 project deployed on Vercel. All LLM calls go through the Keywords AI unified gateway using the OpenAI SDK, which lets us hit 9 models across 4 providers with a single API key. The frontend uses server-sent events for real-time streaming. You watch each agent work in real time as the pipeline runs. Run history is persisted in Supabase. We use the @keywordsai/tracing SDK to instrument the pipeline so every agent shows up in Keywords AI's trace view.

Challenges

Getting the optimization loop to actually show measurable improvement was the hardest part. Our initial approach embedded the user's requirements directly into the benchmark prompt, which meant every model already had the requirements, leaving nothing for the optimizer to improve. We had to restructure so the initial benchmark uses the raw prompt, and only the optimizer's rewritten version includes the requirements.

We also ran into issues with some models returning Python code or markdown instead of JSON when asked for structured output. Since the Keywords AI gateway doesn't support response_format, we built a retry system with progressively stricter JSON-only prompts and a robust extraction layer.

What we learned

Keywords AI's unified gateway made it surprisingly easy to test across providers. Swapping between GPT-5, Claude, Gemini, and DeepSeek is just changing a model string. The tracing SDK gave us full observability into the multi-agent pipeline, which was critical for debugging agent decisions during development.