Inspiration
- Generative AI is great at words, but brittle at actions.
- Judges (and users) need results they can trust: not just “talking about” actions, but actually doing them reliably.
- With GPT‑OSS models (e.g., gpt‑oss‑20b on Groq), we wanted to prove you can get both creativity and dependable execution in one streaming flow.
What it does
- Turns messy GPT‑OSS model output into reliable, actionable steps.
- Structures the stream into small, checkable pieces with one automatic repair if needed.
- Pauses the model mid‑stream to run real tools (e.g., search → booking), then resumes and finalizes the answer.
- Keeps runs predictable (seed, temperature) and saves artifacts (prompt, frames, result, metrics) for replay and audits.
- Includes a conformance suite (21 checks) that proves ordering, repair/fallback, retry/timeout, idempotency, and schema conformance.
How we built it
- Provider/runtime: Groq with GPT‑OSS (gpt‑oss‑20b; OpenAI‑compatible API).
- Transport: Server‑Sent Events (SSE) streaming with
json.*,tool.*,result.*,error,doneevents. - Protocol:
- Partial‑JSON frames with a one‑time repair attempt.
- Soft fallback to a minimal valid object if repair fails, marked
degradedin metrics. - Deterministic runs (temperature=0.2, seed=42) for reproducibility.
- Tools:
- Deterministic mocks (
places.search,bookings.create) backed by fixtures. - Idempotency support to de‑duplicate with
Idempotency-Key.
- Deterministic mocks (
- Developer UX:
- TypeScript server and a compact terminal walkthrough.
- Artifacts viewer that summarizes frames and metrics to keep logs judge‑friendly.
Challenges we ran into
- Making streaming JSON reliable without killing the “live” feeling.
- Coordinating mid‑stream tool execution with GPT‑OSS output (pause → run → resume) without race conditions.
- Handling timeouts and flaky tools gracefully while keeping the user experience responsive.
- Balancing full traceability with readable logs for judges (compact summaries vs. raw streams).
Accomplishments that we're proud of
- One‑command demo that judges can follow in under 3 minutes.
- Clear Problem → Solution narration baked directly into the walkthrough output.
- Conformance suite with 21/21 green checks (ordering, repair, fallback, retry, timeout, idempotency, schema).
- Deterministic, replayable runs on Groq + GPT‑OSS (gpt‑oss‑20b).
- Artifacts for every run: prompt, frames, final result, metrics — great for audits and debugging.
What we learned
- GPT‑OSS models can drive real workflows if you enforce a simple, robust streaming contract.
- The “one repair attempt, then soft fallback” pattern preserves user trust without derailing the flow.
- Idempotency and timeouts are not “nice‑to‑have” — they’re essential for safe tool execution.
- Judges prefer compact, human‑readable summaries to walls of logs — and it’s possible without losing detail.
What's next for ToolForge
- SDK polish (TypeScript first; explore Go later) and simple client helpers for React/Node.
- Optional offline/local agent profile (vLLM) with a small built‑in toolset (fs/sqlite/time) for air‑gapped demos.
- WebSocket transport and richer UIs for the artifacts viewer.
- More built‑in tools and provider profiles beyond GPT‑OSS, while keeping the same streaming guarantees.
Built With
- gpt-5
- gpt-oss
- groq
- node.js
- typescript
- windsurf
- zod

Log in or sign up for Devpost to join the conversation.