Behavior Review Sandbox

Hina Nakahira posted an update — Oct 12, 2025 04:59 PM EDT

@Judges !

We want to brag that we improved Qwen's performance score to be as good as Gemini 2.5 Flash, while reducing 27% of cost!

This is the result of less than 1-day reinforcement learning, and we see the clear path that it gets way better overtime.

Also shoutout to:

@Weave & Serverless RL team, We didn't even know what RL is. How the heck did we make this RL result happen?! This is crazy how easy and intuitive it was to monitor, evaluate and train models. The OpenPipe's docs were very beginner-friendly, and that was why we got to learn the brief concept of RL in a day and implement the whole thing. T

@Browserbase, we're surprised how easy we can operate a browser with natural language with Stagehand. We didn't need an extra Playwright codebase!!!

@Daytona, you made this product possible. Without Daytona, we wouldn't have decided to execute strangers (hackathon attendees)' code. We will definitely keep using Daytona for this behavior-driven review purpose daily.

@Tavily, Our agents can fetch each of the small, but important latest updates of frameworks & libraries completely. We're amazed that it caught news we hadn't even noticed yet. It impacts significantly to the quality of review. Tavily is definitely an unreplacable piece of this product.

Log in or sign up for Devpost to join the conversation.