Inspiration

  • Generative AI is great at words, but brittle at actions.
  • Judges (and users) need results they can trust: not just “talking about” actions, but actually doing them reliably.
  • With GPT‑OSS models (e.g., gpt‑oss‑20b on Groq), we wanted to prove you can get both creativity and dependable execution in one streaming flow.

What it does

  • Turns messy GPT‑OSS model output into reliable, actionable steps.
  • Structures the stream into small, checkable pieces with one automatic repair if needed.
  • Pauses the model mid‑stream to run real tools (e.g., search → booking), then resumes and finalizes the answer.
  • Keeps runs predictable (seed, temperature) and saves artifacts (prompt, frames, result, metrics) for replay and audits.
  • Includes a conformance suite (21 checks) that proves ordering, repair/fallback, retry/timeout, idempotency, and schema conformance.

How we built it

  • Provider/runtime: Groq with GPT‑OSS (gpt‑oss‑20b; OpenAI‑compatible API).
  • Transport: Server‑Sent Events (SSE) streaming with json.*, tool.*, result.*, error, done events.
  • Protocol:
    • Partial‑JSON frames with a one‑time repair attempt.
    • Soft fallback to a minimal valid object if repair fails, marked degraded in metrics.
    • Deterministic runs (temperature=0.2, seed=42) for reproducibility.
  • Tools:
    • Deterministic mocks (places.search, bookings.create) backed by fixtures.
    • Idempotency support to de‑duplicate with Idempotency-Key.
  • Developer UX:
    • TypeScript server and a compact terminal walkthrough.
    • Artifacts viewer that summarizes frames and metrics to keep logs judge‑friendly.

Challenges we ran into

  • Making streaming JSON reliable without killing the “live” feeling.
  • Coordinating mid‑stream tool execution with GPT‑OSS output (pause → run → resume) without race conditions.
  • Handling timeouts and flaky tools gracefully while keeping the user experience responsive.
  • Balancing full traceability with readable logs for judges (compact summaries vs. raw streams).

Accomplishments that we're proud of

  • One‑command demo that judges can follow in under 3 minutes.
  • Clear Problem → Solution narration baked directly into the walkthrough output.
  • Conformance suite with 21/21 green checks (ordering, repair, fallback, retry, timeout, idempotency, schema).
  • Deterministic, replayable runs on Groq + GPT‑OSS (gpt‑oss‑20b).
  • Artifacts for every run: prompt, frames, final result, metrics — great for audits and debugging.

What we learned

  • GPT‑OSS models can drive real workflows if you enforce a simple, robust streaming contract.
  • The “one repair attempt, then soft fallback” pattern preserves user trust without derailing the flow.
  • Idempotency and timeouts are not “nice‑to‑have” — they’re essential for safe tool execution.
  • Judges prefer compact, human‑readable summaries to walls of logs — and it’s possible without losing detail.

What's next for ToolForge

  • SDK polish (TypeScript first; explore Go later) and simple client helpers for React/Node.
  • Optional offline/local agent profile (vLLM) with a small built‑in toolset (fs/sqlite/time) for air‑gapped demos.
  • WebSocket transport and richer UIs for the artifacts viewer.
  • More built‑in tools and provider profiles beyond GPT‑OSS, while keeping the same streaming guarantees.

Built With

Share this project:

Updates