Inspiration

"Yet you, my creator, detest and spurn me, thy creature, to whom thou art bound by ties only dissoluble by the annihilation of one of us." -Mary Shelley, Frankenstein

What if you could have Claude CLI without Internet access or API tokens?

This is the question that leafcutter aims to answer: like the eponymous ants, our software aims to multiply the power of tiny, local models to pick up this heavy mantle, except now for free and open source! Claude has officially helped build its own successor.

What it does

leafcutter is the happy jumper that goes around your github repositories and mends your open wounds. No more maggots in these parts, for leafcutter is a scab-mending CLI, solving errors, a bit like a not-so-distant cousin of Claude Code, except this cousin is radically anarchist and solves everything for free! Thank you, Leafcutter! Thank you! 🍃🐜

🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟🦗🦟 jumping cricket noises

How we built it

  • llama-cpp-python runs GGUF models entirely in-process, no server needed
  • GBNF grammars constrain the model's token sampling to valid JSON tool call schemas, solving unreliable output at the source
  • Two-pass inference: first pass decides which tool to call, second pass generates the structured arguments
  • Sliding context window with extractive compression keeps the most relevant lines when token budget fills up
  • Every tool call surfaces a preview and waits for explicit user approval before touching anything
  • CLI built on prompt_toolkit and rich for a clean REPL experience with no web UI required

Challenges we ran into

  • Small models collapse into hallucinated or malformed JSON without grammar constraints — getting GBNF coverage right without adding too much latency took significant tuning
  • Raspberry Pi 3B+ has ~1GB usable RAM, which makes context compression genuinely painful — extractive compression alone isn't enough and we had to be aggressive about what stays in the window
  • A 135M model's context length is tiny, so the scaffolding has to do a lot more heavy lifting than it would with a larger model

Accomplishments that we're proud of

  • A 135M parameter model on a $35 computer, offline, can read a file, spot a bug, write a fix, and ask before applying it — end to end, that works
  • Grammar-constrained function calling turns a near-toy-sized model into something that can participate in a real agentic loop
  • The user is always in control: nothing executes without a confirmation prompt

What we learned

  • Grammar constraints are underused — constraining sampling at the token level is cleaner and more reliable than building regex parsers around broken output
  • Context management is where local agents live or die; the interesting engineering is in that layer, not in the model itself
  • A 2048-token window requires a completely different scaffolding strategy than a 128k one

What's next for leafcutter

  • OpenAI-compatible REST endpoint so leafcutter can act as a drop-in local backend for tools that already speak that protocol
  • Multi-file context so the agent can reason across a whole repository, not just file by file
  • Smarter compression to replace extractive summarization
  • Single binary distribution that works out of the box on a fresh Raspberry Pi with no Python setup required

Built With

Share this project:

Updates