Inspiration

Robot foundation models (VLAs like π0, SmolVLA, the LeRobot/SO-100 ecosystem) are bottlenecked on one boring thing: labeled outcomes. There are millions of teleop/rollout episodes on the Hugging Face Hub, but almost none are labeled with did the robot actually succeed at the task? — and that success/failure signal is exactly what you need for reward models, eval suites, and filtering training data. Hand-labeling it is slow and expensive.

So we did what reCAPTCHA did for OCR and Duolingo did for translation: turn the labeling into a game. Show people a 5-second robot clip, ask "did it succeed?", pay out the crowd consensus as a dataset.

What it does

Rate The Robot is a crowd-judging game that turns anyone with a browser into a robotics data labeler.

  • Judge mode — watch a real robot clip ("grab the pen and put it in the holder"), hit YES / NO / skip. After you vote, the crowd's verdict is revealed.
  • A/B mode — two robots attempt the same task; pick which did it better (pairwise preference data for RLHF-style reward models).
  • Consensus engine — votes are aggregated with reliability-weighted Dawid–Skene; once a clip clears a confidence threshold it locks as ground truth and everyone who agreed gets credited retroactively.
  • Earn & spend — correct judgments earn credits + streaks, redeemable in a commissary (gift cards / cash / charity, Tremendous-style).
  • Anti-cheat — golden honeypot clips, per-player reliability EMAs, response-time floors, entropy/rate limits, and shadow-bans keep the data clean.
  • Admin + data out — a footage dashboard (rotation stats, search, leaderboards, lock/retire), one-click Hugging Face dataset import, auto-pairing, and an export pipeline that emits classifier / reward / policy / "saleable" ML datasets or raw per-clip results and the full ballot log, in CSV/JSONL/JSON.

We loaded it with real footage from HuggingFaceVLA/community_dataset_v3 (SO-100/SO-101 community teleop) — pen-in-holder, pick-and-place, even "play chess."

How we built it

  • Monorepo: Turborepo + Bun. apps/{web,server,worker} + packages/{api,db,contract,auth,storage,env,config}.
  • Web: SvelteKit 5 (runes), a hand-built dark/amber design system, Mermaid architecture diagrams.
  • API: Hono + oRPC, contract-first — a single Zod contract types the client and server end to end, no codegen drift.
  • Data: Drizzle + Postgres (HASH-partitioned votes, keyset pagination, pg_trgm search), read-replica aware (dbRead), Redis for caching + async consensus (CQRS).
  • Media: clips ingested via ffmpeg (full-episode transcode to streamable H.264 + thumbnail), stored in MinIO/S3, served directly from object storage (media never touches the API — CDN-ready).
  • The math: reliability-weighted consensus with deferred open-clip grading (credit-once, race-safe), Dawid–Skene for offline reliability, Bradley–Terry for pairwise clip strength.
  • Auth: Better Auth — frictionless guest play via signed device tokens, upgradeable to email accounts.
  • Deploy: laptop-hosted, exposed on a custom domain via a Cloudflare Tunnel.

Challenges we ran into

  • AV1 that decodes to black. Some HF clips are AV1 that libdav1d (and even Chrome) silently mis-decode — 8 good frames then a frozen black frame, zero reported errors. We added a post-transcode mean-luma guard that rejects black clips at ingest before they ever pollute the rotation.
  • Scaling the dashboard to millions of clips. The original admin list did a COUNT(*) per keystroke and the leaderboard loaded the entire corpus into JS. We rebuilt it: keyset "load more", trigram instruction search, a SQL window-function top-N leaderboard, and cached estimated stats.
  • Scoring people fairly. Open clips can't be graded at vote time — the crowd hasn't decided yet. We built deferred grading that back-credits every earlier voter the instant a clip becomes decided, computing each voter's agreement against consensus excluding themselves so nobody gets paid for a majority they alone formed.
  • The long tail of "live on a laptop": cross-origin auth cookies, a drizzle-kit false "duplicate index" caused by stale compiled .js in the schema dir, Mermaid v11's ER theming, and Bun's --hot not rebuilding new oRPC routes.

Accomplishments that we're proud of

  • A real consensus + economy loop — not a mock. Seed a crowd, run the batch, and your own past votes get genuinely scored, the leaderboard fills, clips lock.
  • End-to-end type safety from one Zod contract across three apps.
  • Ingesting a real Hugging Face VLA dataset as an actual user would — and surviving the messy reality of community data (AV1, weird tasks, huge nested repos).
  • A footage dashboard and export pipeline built to not fall over at millions of rows.
  • Defense-in-depth anti-cheat baked into the scoring math, not bolted on.

What we learned

  • Crowd labeling lives or dies on trust — reliability weighting + honeypots matter more than raw vote count.
  • "Just decode the video" is a lie at scale; codec reality (AV1/libdav1d) will humble you, so validate outputs, don't trust the decoder.
  • Keyset pagination + trigram indexes + windowed SQL turn "works in the demo" into "works at a million rows."
  • Contract-first APIs (oRPC + Zod) erase a whole class of client/server bugs.
  • Tunneling a laptop onto a real domain is mostly DNS patience, not magic.

What's next for Rate The Robot

  • Active learning: stop serving clips at random — prioritize the ones where the crowd is most split or the model is least certain.
  • Model-in-the-loop: pre-label with a VLA, let humans adjudicate disagreements (10× throughput).
  • Pairwise leaderboards & Elo for robots/policies, powered by the Bradley–Terry strengths we already compute.
  • Real payouts (Tremendous) and mobile-first judging.
  • More datasets beyond community_dataset_v3, with per-embodiment and per-task success dashboards.
  • Always-on hosting so it doesn't depend on someone's laptop staying awake.

Built With

Share this project:

Updates