Inspiration
Robot foundation models (VLAs like π0, SmolVLA, the LeRobot/SO-100 ecosystem) are bottlenecked on one boring thing: labeled outcomes. There are millions of teleop/rollout episodes on the Hugging Face Hub, but almost none are labeled with did the robot actually succeed at the task? — and that success/failure signal is exactly what you need for reward models, eval suites, and filtering training data. Hand-labeling it is slow and expensive.
So we did what reCAPTCHA did for OCR and Duolingo did for translation: turn the labeling into a game. Show people a 5-second robot clip, ask "did it succeed?", pay out the crowd consensus as a dataset.
What it does
Rate The Robot is a crowd-judging game that turns anyone with a browser into a robotics data labeler.
- Judge mode — watch a real robot clip ("grab the pen and put it in the holder"), hit YES / NO / skip. After you vote, the crowd's verdict is revealed.
- A/B mode — two robots attempt the same task; pick which did it better (pairwise preference data for RLHF-style reward models).
- Consensus engine — votes are aggregated with reliability-weighted Dawid–Skene; once a clip clears a confidence threshold it locks as ground truth and everyone who agreed gets credited retroactively.
- Earn & spend — correct judgments earn credits + streaks, redeemable in a commissary (gift cards / cash / charity, Tremendous-style).
- Anti-cheat — golden honeypot clips, per-player reliability EMAs, response-time floors, entropy/rate limits, and shadow-bans keep the data clean.
- Admin + data out — a footage dashboard (rotation stats, search, leaderboards, lock/retire), one-click Hugging Face dataset import, auto-pairing, and an export pipeline that emits classifier / reward / policy / "saleable" ML datasets or raw per-clip results and the full ballot log, in CSV/JSONL/JSON.
We loaded it with real footage from HuggingFaceVLA/community_dataset_v3 (SO-100/SO-101 community teleop) — pen-in-holder, pick-and-place, even "play chess."
How we built it
- Monorepo: Turborepo + Bun. apps/{web,server,worker} + packages/{api,db,contract,auth,storage,env,config}.
- Web: SvelteKit 5 (runes), a hand-built dark/amber design system, Mermaid architecture diagrams.
- API: Hono + oRPC, contract-first — a single Zod contract types the client and server end to end, no codegen drift.
- Data: Drizzle + Postgres (HASH-partitioned votes, keyset pagination, pg_trgm search), read-replica aware (dbRead), Redis for caching + async consensus (CQRS).
- Media: clips ingested via ffmpeg (full-episode transcode to streamable H.264 + thumbnail), stored in MinIO/S3, served directly from object storage (media never touches the API — CDN-ready).
- The math: reliability-weighted consensus with deferred open-clip grading (credit-once, race-safe), Dawid–Skene for offline reliability, Bradley–Terry for pairwise clip strength.
- Auth: Better Auth — frictionless guest play via signed device tokens, upgradeable to email accounts.
- Deploy: laptop-hosted, exposed on a custom domain via a Cloudflare Tunnel.
Challenges we ran into
- AV1 that decodes to black. Some HF clips are AV1 that libdav1d (and even Chrome) silently mis-decode — 8 good frames then a frozen black frame, zero reported errors. We added a post-transcode mean-luma guard that rejects black clips at ingest before they ever pollute the rotation.
- Scaling the dashboard to millions of clips. The original admin list did a COUNT(*) per keystroke and the leaderboard loaded the entire corpus into JS. We rebuilt it: keyset "load more", trigram instruction search, a SQL window-function top-N leaderboard, and cached estimated stats.
- Scoring people fairly. Open clips can't be graded at vote time — the crowd hasn't decided yet. We built deferred grading that back-credits every earlier voter the instant a clip becomes decided, computing each voter's agreement against consensus excluding themselves so nobody gets paid for a majority they alone formed.
- The long tail of "live on a laptop": cross-origin auth cookies, a drizzle-kit false "duplicate index" caused by stale compiled .js in the schema dir, Mermaid v11's ER theming, and Bun's --hot not rebuilding new oRPC routes.
Accomplishments that we're proud of
- A real consensus + economy loop — not a mock. Seed a crowd, run the batch, and your own past votes get genuinely scored, the leaderboard fills, clips lock.
- End-to-end type safety from one Zod contract across three apps.
- Ingesting a real Hugging Face VLA dataset as an actual user would — and surviving the messy reality of community data (AV1, weird tasks, huge nested repos).
- A footage dashboard and export pipeline built to not fall over at millions of rows.
- Defense-in-depth anti-cheat baked into the scoring math, not bolted on.
What we learned
- Crowd labeling lives or dies on trust — reliability weighting + honeypots matter more than raw vote count.
- "Just decode the video" is a lie at scale; codec reality (AV1/libdav1d) will humble you, so validate outputs, don't trust the decoder.
- Keyset pagination + trigram indexes + windowed SQL turn "works in the demo" into "works at a million rows."
- Contract-first APIs (oRPC + Zod) erase a whole class of client/server bugs.
- Tunneling a laptop onto a real domain is mostly DNS patience, not magic.
What's next for Rate The Robot
- Active learning: stop serving clips at random — prioritize the ones where the crowd is most split or the model is least certain.
- Model-in-the-loop: pre-label with a VLA, let humans adjudicate disagreements (10× throughput).
- Pairwise leaderboards & Elo for robots/policies, powered by the Bradley–Terry strengths we already compute.
- Real payouts (Tremendous) and mobile-first judging.
- More datasets beyond community_dataset_v3, with per-embodiment and per-task success dashboards.
- Always-on hosting so it doesn't depend on someone's laptop staying awake.
Log in or sign up for Devpost to join the conversation.