ToolForge

Inspiration

Generative AI is great at words, but brittle at actions.
Judges (and users) need results they can trust: not just “talking about” actions, but actually doing them reliably.
With GPT‑OSS models (e.g., gpt‑oss‑20b on Groq), we wanted to prove you can get both creativity and dependable execution in one streaming flow.

Turns messy GPT‑OSS model output into reliable, actionable steps.
Structures the stream into small, checkable pieces with one automatic repair if needed.
Pauses the model mid‑stream to run real tools (e.g., search → booking), then resumes and finalizes the answer.
Keeps runs predictable (seed, temperature) and saves artifacts (prompt, frames, result, metrics) for replay and audits.
Includes a conformance suite (21 checks) that proves ordering, repair/fallback, retry/timeout, idempotency, and schema conformance.

Provider/runtime: Groq with GPT‑OSS (gpt‑oss‑20b; OpenAI‑compatible API).
Transport: Server‑Sent Events (SSE) streaming with json.*, tool.*, result.*, error, done events.
Protocol:
- Partial‑JSON frames with a one‑time repair attempt.
- Soft fallback to a minimal valid object if repair fails, marked degraded in metrics.
- Deterministic runs (temperature=0.2, seed=42) for reproducibility.
Tools:
- Deterministic mocks (places.search, bookings.create) backed by fixtures.
- Idempotency support to de‑duplicate with Idempotency-Key.
Developer UX:
- TypeScript server and a compact terminal walkthrough.
- Artifacts viewer that summarizes frames and metrics to keep logs judge‑friendly.

Making streaming JSON reliable without killing the “live” feeling.
Coordinating mid‑stream tool execution with GPT‑OSS output (pause → run → resume) without race conditions.
Handling timeouts and flaky tools gracefully while keeping the user experience responsive.
Balancing full traceability with readable logs for judges (compact summaries vs. raw streams).

One‑command demo that judges can follow in under 3 minutes.
Clear Problem → Solution narration baked directly into the walkthrough output.
Conformance suite with 21/21 green checks (ordering, repair, fallback, retry, timeout, idempotency, schema).
Deterministic, replayable runs on Groq + GPT‑OSS (gpt‑oss‑20b).
Artifacts for every run: prompt, frames, final result, metrics — great for audits and debugging.

GPT‑OSS models can drive real workflows if you enforce a simple, robust streaming contract.
The “one repair attempt, then soft fallback” pattern preserves user trust without derailing the flow.
Idempotency and timeouts are not “nice‑to‑have” — they’re essential for safe tool execution.
Judges prefer compact, human‑readable summaries to walls of logs — and it’s possible without losing detail.

SDK polish (TypeScript first; explore Go later) and simple client helpers for React/Node.
Optional offline/local agent profile (vLLM) with a small built‑in toolset (fs/sqlite/time) for air‑gapped demos.
WebSocket transport and richer UIs for the artifacts viewer.
More built‑in tools and provider profiles beyond GPT‑OSS, while keeping the same streaming guarantees.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.