# Sokoban: a spatial-reasoning benchmark for LLMs
## Inspiration
We wanted a benchmark that goes after something most text benchmarks miss: can a language model reason about physical space, not just recall facts, but build a mental model of a world and predict how its own actions change that world?
Sokoban is the perfect testbed. It's simple to state but demands real planning, and crucially it contains irreversibility. Many of our levels allow soft-locking: a careless push can shove a box into a corner or slide a ramp off its only useful square, leaving the puzzle permanently unsolvable. That forces a model to understand its surroundings and think before it acts, exactly the kind of consequence-aware reasoning we wanted to measure.
## What it does
Sokoban is a top-down puzzle played on a two-height grid. Every cell is terrain
(low or high), a wall, a ramp, a box, or the player. You move N/S/E/W, drop down a
ledge for free, but climb the single height step only by crossing a
correctly-oriented ramp — or by standing on top of a box. Reach the goal cell to win.
Heights are $h \in {0, 2}$ (low, high), and a box is one step ($2$) tall. Your standing height is
$$ s \;=\; h_{\text{terrain}} \;+\; 2\cdot[\text{on a box}] \;\in\; {0, 2, 4}. $$
A move into an adjacent cell with surface height $h'$ is allowed when $h' \le s$ (flat ground or a free drop); a step up ($h' = s + 2$) is a cliff, blocked unless a ramp bridges it. Boxes are pushed when you're level with their base, climbed when you're at or above their top, and impassable from below.
The environment is served over a small JSON/HTTP API (/reset, /step, /close)
that plugs straight into the Mesocosm / BenchAnything platform, so any LLM can be
dropped in as the agent. Scoring is proportional to efficiency: solve a level in
the optimal number of moves and you score $1.0$; take twice as many and you score
$0.5$. We also ship a browser replay viewer (2D + a Three.js 3D view) that
animates an agent's run move-by-move, with its reasoning shown at each step.
## How we built it
- The engine (
env.py). A compact rules engine over a composite integer grid (codes0–12encode terrain, walls, ramps by orientation, boxes, and the player at each elevation). It re-emits a rich observation each turn: an ASCII map, a natural languagedescriptionwith aMoves: N=…, S=…line,last_move_result, and a rollingmove_history. - Authored, verified maps. 15 hand-designed levels from basic → boxes → ramps →
challenge. A BFS solver (
verify_maps.py) guarantees every map is solvable and that each map'soptimal_stepsequals the true shortest solution. Treating each level as a state graph $$ \sigma = (p,\; e,\; B,\; R) \quad\text{(player, elevation, box set, ramp set)}, $$ BFS computes $d^(\sigma_0)$, the optimal step count, which also defines the reward: $\text{reward} = \min(1,\; d^(\sigma_0) / \text{steps_used})$. - The agent harness. We run models through the Mesocosm SDK (Anthropic, OpenAI,
Gemini) and locally against Ollama, with a chain-of-thought system prompt that asks
the model to plan, then commit with
ACTION: X. - The replay frontend. A JavaScript port of the engine (
game.js), verified move-for-move against the Python engine via a parity test, drives both a 2D grid renderer and a 3D Three.js scene that tweens the ball up ramps and off ledges, with resizable panels and per-step reasoning.
## Challenges we ran into
The biggest one was a genuine detective story. Early runs of strong models were failing levels with optimal solutions as short as 5 moves, getting stuck in perfect oscillations like $W, E, W, E, \dots$ — and every captured reasoning field was empty. Our first hypothesis (that the move history wasn't reaching the model) turned out to be wrong; we reconstructed the exact prompt and the history was right there.
The real culprit was in the harness. Our action space was declared discrete, which
made the platform force structured tool output. That gagged the model: it was
forced to emit a bare {"action": "W"} with no room to reason first, directly
overriding our "think, then ACTION: X" prompt. The models weren't too dumb — they
were never allowed to think.
Other challenges:
- Soft-lock-aware level design. Building puzzles that are solvable but unforgiving meant the reachable-state graph is directed — a soft-lock is a state with $d^(\sigma) = \infty$ — so BFS verification had to confirm both solvability *and fair difficulty.
- Engine parity. Keeping the JS replay engine bit-for-bit identical to the Python rules engine across ramps, box-stacking, and ledge drops.
- Provider rate limits. Free-tier limits (e.g. Gemini's $15$ requests/min) and token-per-minute caps aborted runs mid-episode, which we had to separate from true task failures.
## Accomplishments that we're proud of
- We diagnosed and fixed the reasoning gag. Switching the action space from
discretetotextlet the model reason and emitACTION: X, which our engine already parses. - We proved it with a controlled experiment. A local A/B test held the model and maps fixed and toggled only the gag; free-text reasoning matched or beat the gagged agent on every seed and turned failures into wins.
- On the platform, Sonnet went from 8/15 to 10/15 solved, with reasoning captured on 100% of steps (up from 0%) — and the exact oscillation levels it used to fail were now solved in the optimal number of moves.
- A polished, faithful 2D/3D replay viewer that makes an agent's spatial reasoning legible.
## What we learned
- Interface design can dominate model capability. The single biggest lever on benchmark scores wasn't the model — it was whether the harness let it think. Forced structured output silently suppressed chain-of-thought.
- Behavioral fingerprints are diagnostic. A clean $W, E, W, E$ cycle is the signature of a memoryless, gagged agent; reading the trajectory told us more than the aggregate score did.
- Verify your benchmark, not just your model. BFS ground truth and Python↔JS engine parity caught bugs that would otherwise have looked like model errors.
- Separate infrastructure failures from task failures — rate limits and API errors are not the same as a model getting a puzzle wrong.
## What's next for Sokoban
More levels and richer mechanics that push on the limits of what LLMs actually understand about space and the physical world — deeper soft-lock traps, multi-box and multi-ramp dependencies, and longer-horizon plans where a single early misstep is irreversible. The goal is a benchmark whose difficulty scales with physical reasoning depth, not just path length, so we can keep measuring how well models grasp cause, consequence, and the geometry of the worlds they're asked to act in.
Log in or sign up for Devpost to join the conversation.