Litmus — The Glassdoor for AI Agents

Inspiration

The AI agent ecosystem is exploding — there are hundreds of agents launching every week for customer support, coding, sales, and more. But choosing the right one feels like gambling. Marketing pages all claim "best-in-class accuracy" and "human-like conversations," yet there's no independent, standardized way to verify those claims. We asked ourselves: what if there was a Glassdoor for AI agents? A transparent, data-driven marketplace where every agent is put through rigorous, reproducible evaluations — not just self-reported metrics — so teams can make confident decisions before committing time and budget.

What it does

Litmus is an AI agent evaluation marketplace that lets users discover, benchmark, and compare AI agents using multiple assessment signals:

Gemini-Powered Benchmarks — Run live LLM-as-judge evaluations across five dimensions (accuracy, coherence, helpfulness, hallucination detection, and completeness) with weighted composite scoring. Supports multiple benchmark types including customer support simulation, tool use, and code generation.
Real-Time Voice Evaluation — Actually call voice agents via Plivo, stream bidirectional audio through Gemini Live, and score the conversation on naturalness, helpfulness, latency, accuracy, and tone.
Web Intelligence Monitoring — Automatically gathers changelogs, outage reports, pricing changes, and third-party reviews from across the web using You.com search, then summarizes them with Gemini for an always-current intelligence feed.
Tool Verification — Agents claim integrations with Slack, GitHub, Jira, etc. Litmus verifies those claims through Composio, so users know what's real and what's marketing.
Side-by-Side Comparison — Compare up to four agents on a radar chart with AI-powered recommendations tailored to your specific use case.
Community Reviews — Real users rate agents with star ratings and use-case context, building a trust layer on top of the automated evaluations.

How we built it

Litmus runs on three coordinated services:

Next.js 16 App (React 19, TypeScript, Tailwind CSS 4) — Handles the frontend, 14 API routes, auth (Google/GitHub OAuth via Supabase), and orchestrates all evaluation workflows. Long-running operations like benchmarking and intelligence gathering use async fire-and-forget patterns to return responses instantly (202 Accepted) while processing continues in the background.
Fastify WebSocket Server — A standalone service that bridges Plivo voice calls with the Gemini Live API. It handles real-time bidirectional audio streaming and converts between audio formats (mu-law 8kHz from Plivo to PCM 16/24kHz for Gemini) on the fly. We separated this from Next.js to isolate the real-time audio complexity and allow independent scaling.
Supabase — Postgres database with six tables, Row-Level Security policies, full-text search via custom RPC functions, OAuth, Realtime subscriptions, and server-side score aggregation.

Key integrations include Gemini (LLM-as-judge for benchmarks, native audio model for voice, profiling and comparison generation), Plivo (outbound voice calls and audio streaming), You.com (web search for intelligence gathering), Composio (tool/integration verification), and Intercom (conversation metrics). Everything deploys to Render via a unified render.yaml configuration.

Challenges we ran into

Real-time audio format conversion was one of the trickiest parts. Plivo sends mu-law encoded audio at 8kHz, but Gemini's native audio model expects PCM at 16–24kHz. Getting bidirectional conversion working reliably with low latency required careful buffer management and a dedicated WebSocket server.
Scoring fairness — Designing a weighted scoring system that meaningfully compares very different types of agents (a coding assistant vs. a customer support bot) required extensive iteration on dimension weights and benchmark type design.
Async orchestration — When a user submits a new agent, we kick off profile generation, intelligence gathering, and tool verification simultaneously as fire-and-forget operations. Making sure failures in one path don't cascade to others while still providing a responsive UI was an exercise in error isolation.
Supabase type safety — The generated Json type in Supabase is a strict union that doesn't accept Record<string, unknown>. We had to develop a consistent casting pattern through unknown to keep TypeScript happy without losing safety elsewhere.

Accomplishments that we're proud of

End-to-end voice evaluation pipeline — A user enters a phone number, Litmus places a real call to the agent, streams the audio through Gemini Live for a natural conversation, and then evaluates the transcript on five dimensions. The entire flow works seamlessly across three services.
Sub-second API responses for heavy operations — Benchmarks, intelligence gathering, and profile generation all return instantly to the user while processing asynchronously in the background, with real-time UI updates via Supabase Realtime.
Genuine tool verification — Instead of trusting self-reported capabilities, we actually validate agent integrations through Composio, bringing a layer of accountability that doesn't exist anywhere else in the ecosystem.
A coherent multi-signal evaluation model — Combining automated benchmarks, live voice testing, web intelligence, tool verification, and community reviews into a single composite score that's actually useful for decision-making.

What we learned

Gemini's native audio model is remarkably capable for real-time voice interactions — using it directly rather than a speech-to-text-to-inference-to-TTS pipeline dramatically reduced latency and improved conversation quality.
Separating concerns pays off — Running the WebSocket server as its own service made development, debugging, and deployment significantly easier than trying to embed real-time audio handling in Next.js.
LLM-as-judge is powerful but needs structure — Giving Gemini clear rubrics, dimension definitions, and scoring scales produces consistent, reproducible evaluations. Without that structure, scores drift significantly between runs.
Next.js 16's after() callbacks are a game-changer for fire-and-forget patterns — they let you return a response immediately and continue processing without needing external job queues.

What's next for Litmus

Automated regression testing — Scheduled re-benchmarks that detect when an agent's quality changes over time, with alerts for score drops.
Custom benchmark suites — Let teams define evaluation criteria specific to their domain (legal, healthcare, finance) and run those against any agent.
Agent API testing — Beyond voice and text, directly evaluate agent APIs with structured input/output testing and latency profiling.
Leaderboards and trends — Public rankings by category with historical score tracking, so users can see which agents are improving and which are declining.
Embeddable trust badges — Verified Litmus scores that agent vendors can display on their own sites, similar to app store ratings.