Inspiration
Agents today are stateless. Every session starts from zero—same research, same mistakes, same cost. Memory self-evolves but only stores facts. Skills encode procedures but in static format. Neither learns from experience. Can we bring the best from both worlds?
What it does
Tsugi is an agent harness that automatically bootstraps battle-tested how-tos from real experiences. Tsugi is a dual-agent system:
- Task Agent executes your request in an isolated sandbox. It researches using Google Search, analyzes URLs, runs shell commands, makes mistakes, and self-corrects until it succeeds.
- Skill Agent analyzes what the Task Agent learned and codifies it into a reusable skill—capturing both procedural knowledge (API quirks, validation rules, error patterns) and user preferences (your taxonomies, classification rules, domain constraints).
- The result: Run 1 start fresh with trial-and-error. Experience then get codified upon completion. Run 2 finds the skill, skips research, and executes directly. The skill library compounds over time—every successful task makes future tasks faster and cheaper.
How we built it
- Framework: Next.js 16 with React 19 and Vercel AI SDK 6
- LLM: Gemini 3 thinking
- Grounding: Custom wrapper around Gemini's native Google Search and URL analysis tools (we bypassed the limitation that native + custom tools can't coexist)
- Sandbox: Dual execution environments—local Node.js child_process for development, Vercel Sandbox microVMs for production isolation
- Storage: SQLite/Turso for conversations, Vercel Blob for skill content
- Observability: Braintrust integration for token counting and trace analysis
Challenges we ran into
- Native + Custom Tools Incompatibility: Gemini doesn't allow native grounding tools and custom tools in the same request. We built a wrapper that makes nested
generateTextcalls with native tools enabled, giving us the best of both worlds. - Sandbox Reconnection: Multi-turn tasks need persistent sandboxes, but Vercel sandboxes timeout after 5 minutes. We implemented health checks, sandboxId tracking, and graceful fallback to new sandboxes when reconnection fails.
- Extracting Real Procedures, Not Ideal Ones: The Skill Agent kept inventing "better" approaches instead of documenting what actually worked. We refined the prompt to capture "ACTUAL working procedure using ONLY tools/methods actually used."
- Multi-Step Token Counting: The agent route only captured the final step's token usage. We modified the API to accumulate usage across all step-finish events for accurate metrics.
Accomplishments that we're proud of
- Measurable 3-5x speedup from Run 1 to Run 2 on integration tasks
- Production-ready dual-agent architecture with clean separation between execution and learning
- Deep Gemini integration leveraging reasoning, KV caching, and native grounding
- Working demo suite: YouTube→Notion sync, and personalized morning briefs
- Comparison mode UI with side-by-side Run 1 vs Run 2 visualization showing time/token savings
- Skills encode two knowledge types: procedural (how APIs actually work) AND preferences (how you want things done)
What we learned
- Skill quality requires enforcement: Agents will naturally try to "improve" procedures. Strict instructions to document what worked, not what should work, are essential.
- Preferences are underrated knowledge: Much of the friction in AI interactions comes from not knowing user-specific taxonomies and constraints—things no documentation can teach.
- Sandbox lifecycle is non-trivial: Persistent execution environments across conversation turns require careful state management, health checks, and graceful degradation.
Built With
- braintrust
- claude
- gemini
- typescript
- vercel
Log in or sign up for Devpost to join the conversation.