Inspiration
Every engineering team has felt it: the 2 AM release that breaks production because nobody caught the risky migration, the open PR that wasn't reviewed, the test suite that was secretly failing for days. Release decisions are made on gut feel, under pressure, with incomplete information.
ShipClaw exists to change that. It gives teams a deterministic, explainable release readiness score before a single line hits production.
What it does
ShipClaw is a fully autonomous release readiness agent. Point it at any GitHub repository and it:
- Scores release readiness across 6 categories — test coverage, CI status, open PRs, migration risk, dependency freshness, and documentation completeness — producing a composite 0–100 score with a GO / HOLD / BLOCK verdict.
- Explains every risk in plain English using NVIDIA Nemotron (mistralai/mistral-nemotron via the NIM API) — the model reasons over the actual evidence, never invents findings, and cites the specific commits or files that triggered each concern.
- Fetches external evidence via Exa search to cross-reference known CVEs, deprecation notices, and ecosystem-wide issues against the repo's dependency tree.
- Gates on approval — releases scored BLOCK require explicit human approval before proceeding, surfaced as an accessible role=alert dialog.
- Remembers — a JSON memory store tracks historical run results so the agent can flag regressions ("last week this repo was at 78, today it's 41").
- Audits — every agent decision is written to an append-only audit log with timestamps and reasoning traces.
How we built it
- Agent loop: 17-state finite state machine in TypeScript, running server-side on Node.js/Express. Each state is an isolated async function; transitions are deterministic given the scoring output.
- Scoring engine: Parallel GitHub API calls (Octokit) aggregated into a weighted rubric. Scores are reproducible — same repo, same commit, same score.
- NVIDIA Nemotron: Called via the OpenAI-compatible NIM endpoint. The prompt is carefully structured to prevent hallucination: the model receives only facts extracted from the repo (test counts, CI run results, PR titles, migration file names) and is instructed to explain, not invent.
- Exa integration: Optional external evidence pass that queries Exa for security advisories and deprecation notices matching the repo's dependency manifest.
- Frontend: React + Vite + Tailwind. The hex-grid loading overlay, live score panel, risk fingerprint table, and approval gate are all keyboard-navigable and screen-reader compatible.
- Deployment: Live at https://shipclaw.onrender.com — demo mode pre-fills a realistic repo payload so judges can run a full analysis without needing a GitHub token.
Challenges we ran into
- Nemotron prompt engineering: Getting the model to reason over structured evidence without drifting into generic advice required several iterations of system prompt design. The final prompt explicitly constrains the model to only cite evidence provided in the user turn.
- Deterministic scoring under async concurrency: Six scoring categories fetch data in parallel; reconciling partial failures (rate limits, 404s on deleted branches) required a robust fallback scoring layer.
- Approval gate UX: Making a blocking human-in-the-loop step feel natural in an otherwise autonomous flow — without it feeling like a modal trap — required careful focus management and ARIA live region design.
Accomplishments that we're proud of
- A fully working autonomous agent loop, 34/34 tests passing, deployed and live
- Nemotron explanations that are genuinely useful — specific, evidence-grounded, and actionable
- A release score that correlates meaningfully with actual release risk in real repos we tested
What we learned
NVIDIA's NIM inference API makes it practical to run large reasoning models in a low-latency agentic loop. The OpenAI-compatible interface meant we could swap Nemotron in with minimal friction. The model's reasoning quality — its ability to weigh multiple risk signals and produce a coherent narrative — exceeded our expectations for a tool-use context.
What's next for ShipClaw
- GitHub Actions integration so ShipClaw runs automatically on every release PR
- Slack/Teams notifications with the score and top blockers
- Multi-repo portfolio view for platform engineering teams managing dozens of services
- Fine-tuned Nemotron variant trained on historical release postmortems
Built With
- api
- elevenlabs
- exa
- mistralai/mistral-nemotron)
- moviepy
- nemotron
- nim
- node.js
- nvidia
- python
- react
- search
- typescript
Log in or sign up for Devpost to join the conversation.