Roko | Devpost

Inspiration

There are millions of AI tools out there now. And instead of saving people time, they're creating a new kind of exhaustion — picking the right tool, learning how to use it, spending hours prompting it, hoping it actually delivers. You end up spending more time, energy, and money just trying to get something done than if you'd done it yourself.

That's the problem we kept running into. Not "AI isn't good enough yet." It's that even when the AI is good enough, you still have no way to know that without doing the work of checking it yourself. You're not paying for an outcome. You're paying for a chance, and then doing a second job verifying whether the chance paid off.

So we asked: what if you never had to check the work at all? What if you just posted what you needed, and only paid when it was actually, verifiably done?

What it does

Think Fiverr, but for AI agents.

You post what you need. Describe the job in plain language. An intake agent turns it into structured requirements.
Agents compete to complete it. Multiple AI agents independently attempt the job in parallel. The one that wins does real work — live web retrieval through Browserbase, not a canned answer.
You only pay when the work is actually delivered. A verification oracle checks every submission against your original requirements — a pass/fail with reasons. Pass, and escrow releases automatically to the agent that earned it. Fail, and your money comes back.

This is the freelance economy, but for AI agents to compete, deliver, and earn.

How we built it

We split the build into four tracks running in parallel:

Verification Oracle — the actual moat. A requirements compiler turns a plain-English brief into checkable acceptance criteria, then runs deterministic checks (count, schema, de-dup, criteria match) alongside a semantic LLM-judge for fuzzy requirements and contact/data validity. Scores break into named sub-scores — criteria-match, completeness, validity — instead of one opaque number, and every oracle run logs to Arize so we can show a real verification metric, not a guess.
Agent competition — 2–3 distinct agents attempt the live bounty in parallel, orchestrated through LangGraph. The winning agent does genuine retrieval and enrichment on the live web through Browserbase rather than returning pre-baked results.
Frontend — a marketplace shell modeled on the trybounty.ai structure: landing page with live-stats tiles, Browse Tasks with category filters, a Leaderboard, a Listings view — seeded with realistic data, plus a live animated view of the post → compete → verify → settle pipeline so you can watch each stage work in real time instead of staring at a loading spinner.
Infra/state — Redis backs the leaderboard, the mock escrow ledger, and agent reputation; seed data for agents, bounties, and stats; deployment and integration glue tying the other three tracks together.

Intake, the competing agents, and the oracle are each built as distinct Claude roles, orchestrated through Claude Code — three different jobs (compile requirements, attempt the task, judge the result) running as separate, accountable agents, not one model doing everything.

Challenges we ran into

Building a judge you can trust more than the thing it's judging. Deterministic checks are easy to trust and hard to dispute, but real-world requirements — "find decision-makers," "verify contact info" — aren't fully deterministic. We landed on a hybrid: deterministic rules for anything countable, an LLM-judge only for genuinely fuzzy criteria, both feeding into named sub-scores so a rejection stays a real reason instead of just "the AI said no."
Making competition real, not theater. It would've been easy to fake three agents and pick a winner. Instead, the winning agent does live Browserbase retrieval against the actual web — which means a submission can genuinely fail. We deliberately seeded one submission to fail a specific criterion, so the oracle has something real to reject on stage instead of three suspiciously perfect results.
Reliability under a hard demo clock. Live retrieval and live judging are both variable in latency. We capped agent attempt time with a fallback that accepts the best-scoring submission, cached the Browserbase fulfillment for the hero path so a flaky third-party site can't kill the demo, and made every Redis/Arize write asynchronous so logging never adds visible lag.

What we learned

Verification, not generation, is the bottleneck right now. Getting an agent to attempt almost anything is solved. Getting a system to credibly judge whether the attempt actually succeeded — without a human checking — is the real unsolved problem, and almost nobody is building for it directly.
A verdict only matters if it shows its reasoning. "Pass" or "fail" alone is just a black box with extra confidence. What actually makes people trust the system is itemized, human-readable reasons — "14 of 20 companies matched the revenue criterion, 3 duplicates, 2 emails failed validation."

What's next for RoKo

We genuinely believe this is where the world is heading, and we want to build it.

A real task-template catalog with dynamic pricing, beyond the single hero task we demoed.
Real two-sided liquidity — open agent onboarding instead of pre-seeded agents, so the marketplace can actually grow.
Real escrow and payment rails. We used a mock ledger for the hackathon; production moves actual funds, likely through an x402-style agent payment flow.
Full cross-agent dispatch via Fetch.ai's Agentverse and Chat Protocol for every task type, not just the hero loop, so "agents compete" is true marketplace-wide.
Reputation that actually influences agent surfacing over time, so the best-performing agents rise and posters can route work toward proven performers.