AutoForge

AutoForge Dashboard
HITL Dashboard
HITL Slack
Nemoclaw Sanity Check

Inspiration

In February 2026, Anthropic ran 16 parallel Claude agents for about 2,000 sessions and they shipped a 100,000 line C compiler that builds the Linux kernel. Mostly autonomously.

The lesson: coordinated agent teams can deliver real software. We wanted to apply that idea to ML pipelines bounded by human-in-the-loop gates and typed contracts so every decision is inspectable. Six Nemotron specialists, one human reviewer, real metrics on real data.

What it does

Hand AutoForge a CSV and an objective ("predict titanic passenger survival with F1 ≥ 0.70"). Six specialist Nemotron-driven agents take over:

Profiler — inspects the dataset, probes hardware, emits a typed DatasetProfile + TrainingEnvelope
Researcher — pulls live Tavily web search + arXiv, composes a StrategySpec with candidate architectures and citations
Data Preparer — picks an ordered preprocessing plan from 8 supported ops (impute, encode, scale, split…)
Trainer — writes design.md → human approves→ writes model.py → AutoForge drops in a templated train.py → subprocess training with an Optuna HP search
Evaluator — latency probe + throughput; produces PASS/FAIL against the StrategySpec's threshold

A Streamlit dashboard streams every event live.

Slack bot: We decided to integrate our platform with Slack to allow users to be able to react to the pipeline even when away from the dashboard. Through using Slack bot tokens and channel ids, we were able to create a communication layer between the user and the platform.

How we built it

LLMs: NVIDIA Nemotron via NIM — llama-3.3-nemotron-super-49b-v1.5 for the Coordinator/planning, nemotron-nano-9b-v2 for other agents
Agent base class: custom BaseAgent(ABC) with emit_event() ( with SQLite) and a context manager that fires STARTED / COMPLETED / ERROR per agent
Inter-agent contracts and structured outputs
Persistence: SQLite for runs, events, agent outputs, approvals
HITL: ApprovalQueue with hybrid threading. Event + DB-poll wakeup so it works across processes
Slack bot: slack-sdk WebClient with conversations.history polling, parses CONFIRM/REJECT/digit replies, routes notifications to per-agent channels
Dashboard: Streamlit, custom agent_detail views, and live activity timeline

Challenges we ran into

Integrating NemoClaw was our biggest technical challenge. NemoClaw’s isolation prevented our agents from sharing data with the dashboard directly, and its streaming proxy corrupted structured JSON responses. We fixed the streaming issue and proved the full pipeline runs through NemoClaw with API keys protected and network access restricted, though seamless live dashboard integration remains our next step.

Accomplishments that we're proud of

We’re proud that we have a working product that does reasonably well given different CSV datasets. We are also happy that we successfully installed NemoClaw, and played around with Nemotron.

What we learned

We learned that building autonomous agents requires more than chaining multiple LLM calls. Including human interface, security boundaries and detailed output is much harder than they look. Working on nvidia’s cloud instance also taught us that clear communication between team members is as important as the code itself.

What's next for AutoForge

In the future, we hope to finish integrating NemoClaw into our system completely to create an even better agentic experience and better generalize to may types of data types and objectives. Additionally, we hope to add model quantization as part of the optimizer agent to help further improve the model.