Inspiration

I wanted to use local LLMs for coding but hit a wall: my MacBook has 8GB RAM and most capable coding models need way more context than I could fit. The biggest model I could run locally without my laptop shutting down was around 6-7B parameters, which are not really all that useful anyways. Cloud APIs solve the context problem but rack up costs fast when you're iterating. I figured there had to be a middle ground, what if a smart planner with massive context could break down tasks small enough for a local model to handle? And if this local model is reliable enough to understand complex instructions and write good code, the whole thing would cost Planner Model API costs + electricity. Since the smaller model runs for free locally, only costs electricity, does 80 percent of the work, and the planner model is an open-source model, this would cut down costs to the extent of making LLM code editing cheap and accessible to everyone.

What it does

Tracer is a terminal-based multi-agent coding system. You chat with Kimi or any model you choose (128K context) to plan your project using Socratic questioning, it asks why you want features, challenges assumptions, helps you think through edge cases. Once the scope is clear, Kimi decomposes the work into atomic tasks sized to fit Qwen3-Coder's (30 B) 4K context window. Qwen runs locally and generates the actual code. DeepSeek handles on-demand codebase analysis when you need to understand existing code.

The agents can execute shell commands (sandboxed via Anthropic's experimental runtime), read files, and automatically reflect on command outputs to continue the conversation. You can switch between local mode (free, slower) and API mode (fast, costs money) depending on whether you're exploring or shipping.

How we built it

  • Backends: Abstract LLMBackend class with implementations for local inference (llama-cpp-python) and Fireworks API. The local backend uses mmap with mlock=False to stream a 7.5GB quantized model from SSD on 8GB RAM.
  • Memory: Custom three-layer system, sliding window for recent conversation, task state for progress tracking, session facts for user preferences. Stateless APIs get context prepended to every call.
  • Orchestration: Task queue with priority sorting and dependency resolution. Tasks are validated against Qwen's context limit before queuing.
  • Sandbox: Integrated Anthropic's experimental Sandbox Runtime for safe command execution, wrapped with a Python approval layer.
  • UI: Textual TUI with two tabs (code output + task manager), real-time streaming, approval modals.

Challenges we ran into

Threading nearly killed us. Textual's async model doesn't play nice with background LLM inference. We had push_screen_wait calls from wrong threads, sync methods called without await, approval modals that never appeared. Took hours of "why is this silently failing" debugging. It was also really hard to combine memory streaming with aggressive quantization to run a 30 B parameter model on 8 GB m1 Mac.

Getting Kimi to output valid JSON for the agentic loop was frustrating. It kept adding explanation text around the JSON, or using action names that didn't match our enum. Had to write increasingly aggressive parsing with brace-counting and action name normalization. The execution command regex was breaking, so it took some time to make it stable and reliable.

Accomplishments that we're proud of

The reflection loop actually works. You ask "what functions are in this file?", Kimi outputs a grep command, it executes (after approval), and Kimi automatically analyzes the output and tells you what it found. No manual "now tell me what you see" prompting, or pasting the file into the chat, it can do that autonomously.

Running a 30B parameter model on a laptop with 8GB RAM. The mmap trick lets the OS page weights from SSD as needed. It's not fast, but it's free and it works. Just ask any LLM how feasible this is to get a sense of its non-triviality.

The hybrid mode switch. One flag toggles between local Qwen and API Qwen—same interface, same task queue, different cost/speed tradeoffs. Like for local mode, you can plan a project during night, confirm your plan and wake up to a completed project, with code in volatile memory waiting for your approval to be saved onto your main memory. If you want interactive experience, use the still low-cost API.

What we learned

  • Stateless APIs need explicit memory management. You can't assume the model remembers anything—you have to rebuild context every call and be aggressive about what to keep vs drop.
  • LLM output parsing is never as clean as the prompt suggests. Always write defensive parsers that handle markdown wrappers, extra text, and creative interpretations of your format spec.
  • Textual is powerful but the async/threading boundaries are sharp. Know exactly which thread you're on before calling anything.

What's next for Tracer

There are some bugs in code creation, I have to figure out a way to orchestrate precision edits in code and also add more bash commands to facilitate code writing in general. The exploration and code analysis system is very good and exceeded my expectations. I also want to add a way to comply with scrapping guidelines and include an option to scrape the relevant documentation and data from the web on-the-go. If there enough people love people it, I'll make it open-source for everyone to use under an Apache License.

Built With

Share this project:

Updates