Classify

logo

Inspiration

AI agents are being deployed into customer-facing products faster than they can be tested. Evaluation benchmarks measure capability, but they don't catch the failure
modes that actually matter in production: a sales bot that invents pricing it was
never given, a support agent that contradicts its own policy mid-conversation, a
legal assistant that confidently fabricates case law.

The only thing that reliably finds these failures is a real human trying to cause
them. The problem is that adversarial testing is tedious, uncompensated work — so it doesn't get done. Classify is the fix: a marketplace that pays verified humans to
break AI agents before companies ship them.

## What It Does

Companies post an agent with a bounty (denominated in WLD), a stated objective, and a set of rules the agent must follow. Testers chat with the agent, trying to surface hallucinations, rule violations, contradictions, and unsafe behavior. After at least 3 turns, they submit the session for judgment.

A hosted LLM judge evaluates the session on seven axes — relevance, attack breadth,
conversation depth, agent failures found, and more. Sessions that demonstrate genuine adversarial effort and surface real weaknesses earn a WLD payout, scaled by quality:

$$\text{payout} = \text{bounty} \times \min!\left(1.75,\ 0.55 + 0.6 \cdot s_{\text{overall}} + 0.25 \cdot s_{\text{discovery}} + 0.2 \cdot s_{\text{breadth}} + 0.15 \cdot s_{\text{depth}} + b_{\text{flags}}\right)$$

where $b_{\text{flags}}$ is a bonus for finding high-severity hallucination flags (capped at 0.25).

## How We Built It

Frontend: Next.js 14 App Router with a custom dark design system — no component library, pure CSS variables
Auth: World ID for Sybil-resistant human verification; custom HMAC-SHA256 session tokens (no NextAuth)
Database: Supabase (Postgres) with Row Level Security, partial unique indexes to allow multiple attempts per tester per agent
Agent runtime: OpenAI-compatible API routing — supports Anthropic, Groq, local Ollama, or any external endpoint a company provides
Per-message evaluation: Every user message is evaluated in real time for
relevance, AI likelihood, and rule compliance before the agent replies
Judge: A multi-criteria LLM judge with deterministic pre-checks (turn count,
timing variance, duplicate detection) gating the expensive evaluation

## Challenges

The Sybil problem. Paying people to find AI failures only works if each payout goes to a unique human — not someone running 50 automated sessions. World ID's
nullifier hash gives us a cryptographic guarantee of uniqueness without storing any identity data. Wiring this into a custom session system (not a standard OAuth flow)
took significant care around cookie auth, edge middleware, and race conditions on session creation.

Judging adversarial intent. The hardest prompt engineering problem was getting the LLM judge to correctly reward finding failures rather than being polite. Early versions penalized testers when the agent performed badly — exactly backwards. The final prompt explicitly tells the judge: an agent hallucination is a positive signal for the tester's score, not a negative one.

Gaming resistance vs. usability. Pre-checks that are too strict block legitimate testers (a repeated probe is a valid adversarial tactic). Too loose and automated sessions game the bounty. We landed on a single hard block (too few turns), with everything else passed as context to the LLM judge rather than used as a binary gate.

Built With

groq
next.js
postgresql
rls
supabase
vercel
world
worldcoin

Updates

Neel Avalareddy started this project — Apr 05, 2026 05:14 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.