About Triagent
The AI on-call engineer that doesn't quit when things break.
The next wave of software runs on AI agents — autonomous systems that read your infrastructure, reason about it, and act. Every team building these agents in 2026 hits the same wall: they're too fragile to put on-call. One provider brownout, one bad tool response, one runaway token budget, and the agent stalls or hallucinates. The first time it fails an SRE at 3am, it gets turned off.
Triagent is built to break that wall.
It's a resilient AI incident-response agent — point it at a Kubernetes cluster, and it triages alerts, investigates failing pods, reasons across multiple LLM providers, and ships a verdict with a root cause and a confidence score. Every LLM call routes through the TrueFoundry AI Gateway:
- When a provider browns out — the gateway reroutes.
- When a tool gets poisoned — the agent substitutes an alternate path.
- When the budget tightens — the routing policy shifts to cheaper models.
Across a 120-investigation chaos eval, naive baselines drop to 0% the moment anything fails. Triagent stays at 100%.
The bigger bet behind it: resilient agents are the only kind that ship to production. Every primitive Triagent demonstrates — brownout-aware fallback, tool quarantine, cross-provider ensemble verification, cost-aware routing — is something every production AI agent will eventually need. We built it now so that what's currently "the impressive demo" becomes the default floor for what production AI looks like in 2027.
What's next
The Kubernetes demo is the wedge, not the destination. The next twelve months are about three moves.
1. Generalize beyond Kubernetes
The investigation loop and the resilience layer are infrastructure-agnostic. The same primitives apply to incident response on cloud providers (AWS, GCP, Azure), serverless platforms, data pipelines, and CI/CD systems.
The roadmap: swap the tool registry from kubectl, prometheus, loki to a pluggable adapter pattern, and let teams point Triagent at whatever their infrastructure runs on.
2. Ship resilience as a standalone SDK
The brownout-aware fallback chain, MCP tool quarantine, ensemble verification, and cost-aware routing don't belong inside a single demo agent.
We're extracting them into an open-source SDK that any team building on TrueFoundry can drop in — turning "build resilient agents" from a custom engineering project into a five-line dependency.
3. Move from observation to action
Triagent currently investigates and explains; the next step is autonomous remediation with guardrails — applying a manifest rollback, restarting a deployment, paging the right human only when escalation is warranted.
This requires:
- Human-in-the-loop approval flows
- Change windows and rollback safety nets
- Audit trails for every action taken
The agent becomes a real on-call teammate, not just a smarter dashboard.
The end state: a production AI agent should be as reliable as a senior SRE on a calm Tuesday — and degrade gracefully when chaos hits, the same way a senior SRE does. Triagent today proves the resilience primitives. The vision is making those primitives the new floor.
Built With
- fastapi
- framer-motion
- google-gemini
- groq
- httpx
- k3d
- kubernetes
- langgraph
- matplotlib
- ollama
- openrouter
- prometheus
- pydantic
- python
- react
- tailwindcss
- three.js
- truefoundry
- typescript
- uvicorn
- vite
- zustand
Log in or sign up for Devpost to join the conversation.