Inspiration
As AI agents move from experimental scripts to enterprise-grade employees, we noticed a terrifying gap: there is no "Undo" button for agent behavior. Today, an engineer can change a single word in a system prompt and accidentally break an entire agentic workflow with no way to diff the change, no test suite to catch the regression, and no "undo" button. We built Canary to bring engineering rigor to the agent economy, providing the reliability layer needed to move agents from experimental scripts to enterprise-ready services.
What it does
Canary is the dedicated version control and CI/CD platform for AI agents. While Fetch.ai's Agentverse provides the hosting and discovery, Canary provides the deployment gates and behavioral safeguards. Behavioral Versioning Engine: Unlike standard Git which tracks code, Canary versions the "soul" of the agent: its prompts, tool configurations, and its performance. This creates an immutable audit trail for every iteration. Multi-Agent Orchestration: We deployed a suite of specialized agents on Agentverse using the uAgents framework: Version Agent: Snapshots agent configurations and assigns unique IDs. Diff Agent: Uses LLM-as-a-judge to compare behavioral outputs between versions, highlighting "silent regressions." Deploy Agent: Manages the state machine for Canary rollouts (Staging → Canary → Prod). Audit Agent: Provides a full "flight recorder" replay of any agent session for compliance and debugging. Automated Eval Pipelines: Every "push" to Canary triggers a behavioral test suite. If accuracy or safety scores drop below a defined threshold, the Deploy Agent automatically halts the rollout, and rolls back to the LKG. Chat Protocol Integration: Built natively on the Fetch.ai Chat Protocol, allowing developers to manage their entire deployment pipeline directly through the ASI:One interface.
How we built it
python (Core Logic) uagents (Fetch.ai Framework) agentverse (Hosting & Discovery) claude-3-5-sonnet (The "Judge" for Evals & Reasoning) mongodb-atlas (Version & Audit Log Storage) git (Back-end Versioning) typescript (Scripting/Hooks) json (Schema Management)
Challenges we ran into
Quantifying "Behavioral Change": Defining what constitutes a "regression" in a non-deterministic system was tough. We overcame this by building a "Golden Test Set" evaluator that compares semantic intent rather than just string matching. State Machine Reliability: Managing a "Canary Deploy" (gradual rollout) across decentralized agents required careful orchestration to ensure that traffic shifted only when safety gates were cleared. Latency vs. Rigor: Running full evaluations on every push can be slow. We optimized our pipeline to run essential "smoke tests" first, providing immediate feedback to the developer.
Accomplishments that we're proud of
Built Git-style version control for agents (immutable versions + history). Implemented CI/CD-style release gates (eval -> canary -> promote/rollback). Added automatic rollback to last known good when reliability drops. Created auditable deployment trails so every release decision is explainable. Unified this in one developer workflow (CLI + backend + dashboard) focused on agent reliability, not just shipping speed.
What we learned
Prompt-as-Code is different: Traditional CI/CD focuses on logic; Agent CI/CD must focus on intent. We learned how to use LLMs to evaluate other LLMs to create meaningful "behavioral diffs." The Power of the Chat Protocol: By building our UI entirely within the Chat Protocol, we realized that the future of DevOps isn't just dashboards—it's conversational. Infrastructure Synergy: We discovered that by sitting on top of Agentverse rather than competing with it, we could amplify the value of every other agent in the ecosystem by making them safer to deploy.
What's next for Canary
Next for Canary is turning a strong prototype into a production-ready reliability platform: complete the full canary promotion flow, add proper database migrations, and ship clear reliability analytics that prove impact (like fewer incidents and faster rollback recovery). From there, strengthen evals with deeper regression/adversarial testing and add enterprise controls like approvals, access control, and alerting. The goal is simple: make Canary the standard way teams safely ship and operate AI agents in production.
Built With
- agentverse
- anthropic
- fastapi
- fetch.ai
- javascript
- next.js
- python
- sqlite
- supabase
- tailwindcss
- typescript
Log in or sign up for Devpost to join the conversation.