tsugi

Inspiration

Agents today are stateless. Every session starts from zero—same research, same mistakes, same cost. Memory self-evolves but only stores facts. Skills encode procedures but in static format. Neither learns from experience. Can we bring the best from both worlds?

What it does

Tsugi is an agent harness that automatically bootstraps battle-tested how-tos from real experiences. Tsugi is a dual-agent system:

Task Agent executes your request in an isolated sandbox. It researches using Google Search, analyzes URLs, runs shell commands, makes mistakes, and self-corrects until it succeeds.
Skill Agent analyzes what the Task Agent learned and codifies it into a reusable skill—capturing both procedural knowledge (API quirks, validation rules, error patterns) and user preferences (your taxonomies, classification rules, domain constraints).
The result: Run 1 start fresh with trial-and-error. Experience then get codified upon completion. Run 2 finds the skill, skips research, and executes directly. The skill library compounds over time—every successful task makes future tasks faster and cheaper.

How we built it

Framework: Next.js 16 with React 19 and Vercel AI SDK 6
LLM: Gemini 3 thinking
Grounding: Custom wrapper around Gemini's native Google Search and URL analysis tools (we bypassed the limitation that native + custom tools can't coexist)
Sandbox: Dual execution environments—local Node.js child_process for development, Vercel Sandbox microVMs for production isolation
Storage: SQLite/Turso for conversations, Vercel Blob for skill content
Observability: Braintrust integration for token counting and trace analysis

Challenges we ran into

Native + Custom Tools Incompatibility: Gemini doesn't allow native grounding tools and custom tools in the same request. We built a wrapper that makes nested generateTextcalls with native tools enabled, giving us the best of both worlds.
Sandbox Reconnection: Multi-turn tasks need persistent sandboxes, but Vercel sandboxes timeout after 5 minutes. We implemented health checks, sandboxId tracking, and graceful fallback to new sandboxes when reconnection fails.
Extracting Real Procedures, Not Ideal Ones: The Skill Agent kept inventing "better" approaches instead of documenting what actually worked. We refined the prompt to capture "ACTUAL working procedure using ONLY tools/methods actually used."
Multi-Step Token Counting: The agent route only captured the final step's token usage. We modified the API to accumulate usage across all step-finish events for accurate metrics.

Accomplishments that we're proud of

Measurable 3-5x speedup from Run 1 to Run 2 on integration tasks
Production-ready dual-agent architecture with clean separation between execution and learning
Deep Gemini integration leveraging reasoning, KV caching, and native grounding
Working demo suite: YouTube→Notion sync, and personalized morning briefs
Comparison mode UI with side-by-side Run 1 vs Run 2 visualization showing time/token savings
Skills encode two knowledge types: procedural (how APIs actually work) AND preferences (how you want things done)

What we learned

Skill quality requires enforcement: Agents will naturally try to "improve" procedures. Strict instructions to document what worked, not what should work, are essential.
Preferences are underrated knowledge: Much of the friction in AI interactions comes from not knowing user-specific taxonomies and constraints—things no documentation can teach.
Sandbox lifecycle is non-trivial: Persistent execution environments across conversation turns require careful state management, health checks, and graceful degradation.

Built With

braintrust
claude
gemini
typescript
vercel

Updates

Runtong Yang started this project — Feb 02, 2026 03:35 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.