# Sokoban: a spatial-reasoning benchmark for LLMs

## Inspiration

We wanted a benchmark that goes after something most text benchmarks miss: can a language model reason about physical space, not just recall facts, but build a mental model of a world and predict how its own actions change that world?

Sokoban is the perfect testbed. It's simple to state but demands real planning, and crucially it contains irreversibility. Many of our levels allow soft-locking: a careless push can shove a box into a corner or slide a ramp off its only useful square, leaving the puzzle permanently unsolvable. That forces a model to understand its surroundings and think before it acts, exactly the kind of consequence-aware reasoning we wanted to measure.

## What it does

Sokoban is a top-down puzzle played on a two-height grid. Every cell is terrain (low or high), a wall, a ramp, a box, or the player. You move N/S/E/W, drop down a ledge for free, but climb the single height step only by crossing a correctly-oriented ramp — or by standing on top of a box. Reach the goal cell to win.

Heights are $h \in {0, 2}$ (low, high), and a box is one step ($2$) tall. Your standing height is

$$ s \;=\; h_{\text{terrain}} \;+\; 2\cdot[\text{on a box}] \;\in\; {0, 2, 4}. $$

A move into an adjacent cell with surface height $h'$ is allowed when $h' \le s$ (flat ground or a free drop); a step up ($h' = s + 2$) is a cliff, blocked unless a ramp bridges it. Boxes are pushed when you're level with their base, climbed when you're at or above their top, and impassable from below.

The environment is served over a small JSON/HTTP API (/reset, /step, /close) that plugs straight into the Mesocosm / BenchAnything platform, so any LLM can be dropped in as the agent. Scoring is proportional to efficiency: solve a level in the optimal number of moves and you score $1.0$; take twice as many and you score $0.5$. We also ship a browser replay viewer (2D + a Three.js 3D view) that animates an agent's run move-by-move, with its reasoning shown at each step.

## How we built it

  • The engine (env.py). A compact rules engine over a composite integer grid (codes 0–12 encode terrain, walls, ramps by orientation, boxes, and the player at each elevation). It re-emits a rich observation each turn: an ASCII map, a natural language description with a Moves: N=…, S=… line, last_move_result, and a rolling move_history.
  • Authored, verified maps. 15 hand-designed levels from basic → boxes → ramps → challenge. A BFS solver (verify_maps.py) guarantees every map is solvable and that each map's optimal_steps equals the true shortest solution. Treating each level as a state graph $$ \sigma = (p,\; e,\; B,\; R) \quad\text{(player, elevation, box set, ramp set)}, $$ BFS computes $d^(\sigma_0)$, the optimal step count, which also defines the reward: $\text{reward} = \min(1,\; d^(\sigma_0) / \text{steps_used})$.
  • The agent harness. We run models through the Mesocosm SDK (Anthropic, OpenAI, Gemini) and locally against Ollama, with a chain-of-thought system prompt that asks the model to plan, then commit with ACTION: X.
  • The replay frontend. A JavaScript port of the engine (game.js), verified move-for-move against the Python engine via a parity test, drives both a 2D grid renderer and a 3D Three.js scene that tweens the ball up ramps and off ledges, with resizable panels and per-step reasoning.

## Challenges we ran into

The biggest one was a genuine detective story. Early runs of strong models were failing levels with optimal solutions as short as 5 moves, getting stuck in perfect oscillations like $W, E, W, E, \dots$ — and every captured reasoning field was empty. Our first hypothesis (that the move history wasn't reaching the model) turned out to be wrong; we reconstructed the exact prompt and the history was right there.

The real culprit was in the harness. Our action space was declared discrete, which made the platform force structured tool output. That gagged the model: it was forced to emit a bare {"action": "W"} with no room to reason first, directly overriding our "think, then ACTION: X" prompt. The models weren't too dumb — they were never allowed to think.

Other challenges:

  • Soft-lock-aware level design. Building puzzles that are solvable but unforgiving meant the reachable-state graph is directed — a soft-lock is a state with $d^(\sigma) = \infty$ — so BFS verification had to confirm both solvability *and fair difficulty.
  • Engine parity. Keeping the JS replay engine bit-for-bit identical to the Python rules engine across ramps, box-stacking, and ledge drops.
  • Provider rate limits. Free-tier limits (e.g. Gemini's $15$ requests/min) and token-per-minute caps aborted runs mid-episode, which we had to separate from true task failures.

## Accomplishments that we're proud of

  • We diagnosed and fixed the reasoning gag. Switching the action space from discrete to text let the model reason and emit ACTION: X, which our engine already parses.
  • We proved it with a controlled experiment. A local A/B test held the model and maps fixed and toggled only the gag; free-text reasoning matched or beat the gagged agent on every seed and turned failures into wins.
  • On the platform, Sonnet went from 8/15 to 10/15 solved, with reasoning captured on 100% of steps (up from 0%) — and the exact oscillation levels it used to fail were now solved in the optimal number of moves.
  • A polished, faithful 2D/3D replay viewer that makes an agent's spatial reasoning legible.

## What we learned

  • Interface design can dominate model capability. The single biggest lever on benchmark scores wasn't the model — it was whether the harness let it think. Forced structured output silently suppressed chain-of-thought.
  • Behavioral fingerprints are diagnostic. A clean $W, E, W, E$ cycle is the signature of a memoryless, gagged agent; reading the trajectory told us more than the aggregate score did.
  • Verify your benchmark, not just your model. BFS ground truth and Python↔JS engine parity caught bugs that would otherwise have looked like model errors.
  • Separate infrastructure failures from task failures — rate limits and API errors are not the same as a model getting a puzzle wrong.

## What's next for Sokoban

More levels and richer mechanics that push on the limits of what LLMs actually understand about space and the physical world — deeper soft-lock traps, multi-box and multi-ramp dependencies, and longer-horizon plans where a single early misstep is irreversible. The goal is a benchmark whose difficulty scales with physical reasoning depth, not just path length, so we can keep measuring how well models grasp cause, consequence, and the geometry of the worlds they're asked to act in.

Share this project:

Updates