Inspiration
AI agents are being deployed into customer-facing products faster than they can be
tested. Evaluation benchmarks measure capability, but they don't catch the failure
modes that actually matter in production: a sales bot that invents pricing it was
never given, a support agent that contradicts its own policy mid-conversation, a
legal assistant that confidently fabricates case law.
The only thing that reliably finds these failures is a real human trying to cause
them. The problem is that adversarial testing is tedious, uncompensated work — so it
doesn't get done. Classify is the fix: a marketplace that pays verified humans to
break AI agents before companies ship them.
## What It Does
Companies post an agent with a bounty (denominated in WLD), a stated objective, and a set of rules the agent must follow. Testers chat with the agent, trying to surface hallucinations, rule violations, contradictions, and unsafe behavior. After at least 3 turns, they submit the session for judgment.
A hosted LLM judge evaluates the session on seven axes — relevance, attack breadth,
conversation depth, agent failures found, and more. Sessions that demonstrate genuine
adversarial effort and surface real weaknesses earn a WLD payout, scaled by quality:
$$\text{payout} = \text{bounty} \times \min!\left(1.75,\ 0.55 + 0.6 \cdot s_{\text{overall}} + 0.25 \cdot s_{\text{discovery}} + 0.2 \cdot s_{\text{breadth}} + 0.15 \cdot s_{\text{depth}} + b_{\text{flags}}\right)$$
where $b_{\text{flags}}$ is a bonus for finding high-severity hallucination flags (capped at 0.25).
## How We Built It
- Frontend: Next.js 14 App Router with a custom dark design system — no component
library, pure CSS variables
- Auth: World ID for Sybil-resistant human verification; custom HMAC-SHA256
session tokens (no NextAuth)
- Database: Supabase (Postgres) with Row Level Security, partial unique indexes
to allow multiple attempts per tester per agent
- Agent runtime: OpenAI-compatible API routing — supports Anthropic, Groq, local Ollama, or any external endpoint a company provides
- Per-message evaluation: Every user message is evaluated in real time for
relevance, AI likelihood, and rule compliance before the agent replies - Judge: A multi-criteria LLM judge with deterministic pre-checks (turn count,
timing variance, duplicate detection) gating the expensive evaluation
## Challenges
The Sybil problem. Paying people to find AI failures only works if each payout
goes to a unique human — not someone running 50 automated sessions. World ID's
nullifier hash gives us a cryptographic guarantee of uniqueness without storing any
identity data. Wiring this into a custom session system (not a standard OAuth flow)
took significant care around cookie auth, edge middleware, and race conditions on
session creation.
Judging adversarial intent. The hardest prompt engineering problem was getting the LLM judge to correctly reward finding failures rather than being polite. Early versions penalized testers when the agent performed badly — exactly backwards. The final prompt explicitly tells the judge: an agent hallucination is a positive signal for the tester's score, not a negative one.
Gaming resistance vs. usability. Pre-checks that are too strict block legitimate testers (a repeated probe is a valid adversarial tactic). Too loose and automated sessions game the bounty. We landed on a single hard block (too few turns), with everything else passed as context to the LLM judge rather than used as a binary gate.
Built With
- groq
- next.js
- postgresql
- rls
- supabase
- vercel
- world
- worldcoin
Log in or sign up for Devpost to join the conversation.