Inspiration
As part of my research at the University of Michigan, I work on coding agents, especially memory systems for long-running agents and better context management. A typical experiment gives me a large batch of trajectories after I change some part of an agent or its tooling.
The usual benchmark summary only tells part of the story. It tells me how many runs submitted, errored, resolved, or stayed unresolved, but not why they failed. The obvious next step is to ask a strong coding model like Codex or Claude Code to analyze the trajectories directly. In practice, that works poorly when the runs are long, repetitive, and spread across a big batch. The signal gets buried in noise, context becomes hard to manage, and the analysis becomes shallow or brittle.
I built TraceForge to solve that problem. Instead of forcing an LLM to read raw trajectories directly, TraceForge parses them ahead of time, builds a structured graph over the batch, clusters recurring failure patterns, and produces compact evidence packs plus reusable memory patches so the model can reason over the important parts instead of drowning in logs.
What it does
TraceForge is a Jac-native batch failure compiler for coding-agent trajectories.
It ingests a batch of trajectories, parses each run into structured steps, extracts files, errors, tests, and outcomes, groups recurring failure motifs, identifies likely critical steps, and generates compact evidence packs plus reusable memory patches.
The key idea is simple:
same failed runs, same outer model, better context.
On the same 100-trajectory batch, TraceForge reduced reported total token usage from 165,358 to 94,890 by replacing raw-log analysis with structured evidence-pack analysis. That is not just a cost win. It is a context-management win: the model sees a more curated representation of the batch, so the analysis becomes easier, sharper, and more scalable.
A memory patch is reusable guidance extracted from repeated failures, such as an AGENTS.md rule or patch idea that helps future agents avoid the same mistake.
How I built it
TraceForge is a Jac-first system with a terminal-first operator workflow centered on the traceforge CLI.
At a high level, the system works in four stages:
- Ingest trajectories from a benchmark run or experiment batch
- Parse and structure each run into typed artifacts such as steps, files, tests, errors, and outcomes
- Build graph-backed analysis objects so repeated patterns can be analyzed across runs
- Generate evidence packs and memory patches so an LLM can reason over the batch more effectively than if it were reading raw logs directly
The implementation includes:
- a mini-SWE-agent-compatible parser for fields like
info,messages,trajectory_format, and tool-call metadata - graph-backed run and batch models for failure clustering and evidence extraction
- dual-pack generation (
rawandstructured) so the same run can be compared under the same outer model context - provider-aware evaluation and strict compare mode for more honest, reproducible comparisons
- a CLI-first workflow designed so an LLM agent can clone the repo, read the docs, and use TraceForge during its own root-cause analysis
In practice, the tool is built more for AI-assisted investigation than for point-and-click human use. The intended workflow is simple: give an agent the GitHub repo, tell it to read the docs, and ask it to use TraceForge to analyze the root causes in a trajectory batch.
Why Jac mattered
Jac shaped the architecture of the project.
Trajectories are naturally graph-shaped. Runs connect to steps, steps connect to files, tests, and errors, and runs also connect to other similar runs. That made graph-native modeling a much better fit than treating everything as flat text.
Jac also made it natural to organize the system around graph traversal and structured outputs instead of a pile of loosely connected scripts. TraceForge is not just a parser or a dashboard. It is a compiler that turns messy trajectory logs into structured evidence the model can actually use.
Challenges
The biggest challenge was moving from smoke-test fixtures to real-world trajectories. Real batches are messy. Some failures come from infrastructure. Some come from context limits. Some come from tool misuse. Some runs contain almost no useful trace at all. That makes it much harder to build a system that is both structured and robust.
Another challenge was keeping the comparison honest. It is easy to claim that a structured system is “better” if the comparison is unfair. I had to think carefully about how to compare raw-log analysis and structured-evidence analysis on the same trajectory batch without hiding behind vague qualitative claims.
A final challenge was deciding who the tool was really for. I realized quickly that the best user is often another coding agent. That meant the project needed to be simple to clone, simple to run, and well documented enough that a model like Codex or Claude Code could pick it up and use it directly during an investigation.
Examples of the workflow include:
traceforge analyze-batch --batch ...
traceforge run --batch ... --run ...
traceforge pack --batch ... --run ... --mode structured
traceforge compare --batch ... --run ... --strict-provider
What I learned
The biggest lesson was that parser fidelity matters more than prompt cleverness. If the trajectory parser is wrong, the downstream analysis becomes unreliable no matter how strong the model is.
I also learned that context management is not just a cost issue. It is an analysis-quality issue. Better-curated context makes the model more useful because it removes duplicated noise and lets the model reason over the parts of the batch that actually matter.
I also learned that for this kind of tool, a terminal-first workflow is often better than a complex UI. Researchers and coding agents already live in terminal-first environments, so the most useful version of the project is one that fits naturally into that workflow.
Why it matters
TraceForge improves coding-agent analysis not by replacing the model, but by improving what the model sees.
Instead of asking an LLM to summarize a giant pile of repetitive logs, TraceForge turns a batch of trajectories into recurring failure motifs, likely critical steps, and reusable memory patches. That makes debugging coding agents faster, more scalable, and more rigorous.
The project is aimed at researchers and coding agents investigating large benchmark runs. The goal is not just to inspect trajectories. The goal is to turn repeated failures into reusable diagnostic memory.
Built with
- Jac
- Python
- Click
- Pydantic
- OpenAI API
- Anthropic-compatible provider interface
Try it out
- https://github.com/Dhravidk/TraceForge
- https://github.com/Dhravidk/TraceForge/blob/main/README.md
- https://github.com/Dhravidk/TraceForge/blob/main/docs/cli/quickstart.md
A good example prompt for an agent is:
Download this repo, read the docs, and use TraceForge to analyze the root cause of the issues in these trajectories so I know what to fix: https://github.com/Dhravidk/TraceForge
Built With
- jac
Log in or sign up for Devpost to join the conversation.