GIF
Chaos Simulator With Triagent On The Move

About Triagent

The AI on-call engineer that doesn't quit when things break.

The next wave of software runs on AI agents — autonomous systems that read your infrastructure, reason about it, and act. Every team building these agents in 2026 hits the same wall: they're too fragile to put on-call. One provider brownout, one bad tool response, one runaway token budget, and the agent stalls or hallucinates. The first time it fails an SRE at 3am, it gets turned off.

Triagent is built to break that wall.

It's a resilient AI incident-response agent — point it at a Kubernetes cluster, and it triages alerts, investigates failing pods, reasons across multiple LLM providers, and ships a verdict with a root cause and a confidence score. Every LLM call routes through the TrueFoundry AI Gateway:

When a provider browns out — the gateway reroutes.
When a tool gets poisoned — the agent substitutes an alternate path.
When the budget tightens — the routing policy shifts to cheaper models.

Across a 120-investigation chaos eval, naive baselines drop to 0% the moment anything fails. Triagent stays at 100%.

The bigger bet behind it: resilient agents are the only kind that ship to production. Every primitive Triagent demonstrates — brownout-aware fallback, tool quarantine, cross-provider ensemble verification, cost-aware routing — is something every production AI agent will eventually need. We built it now so that what's currently "the impressive demo" becomes the default floor for what production AI looks like in 2027.

What's next

The Kubernetes demo is the wedge, not the destination. The next twelve months are about three moves.

1. Generalize beyond Kubernetes

The investigation loop and the resilience layer are infrastructure-agnostic. The same primitives apply to incident response on cloud providers (AWS, GCP, Azure), serverless platforms, data pipelines, and CI/CD systems.

The roadmap: swap the tool registry from kubectl, prometheus, loki to a pluggable adapter pattern, and let teams point Triagent at whatever their infrastructure runs on.

2. Ship resilience as a standalone SDK

The brownout-aware fallback chain, MCP tool quarantine, ensemble verification, and cost-aware routing don't belong inside a single demo agent.

We're extracting them into an open-source SDK that any team building on TrueFoundry can drop in — turning "build resilient agents" from a custom engineering project into a five-line dependency.

3. Move from observation to action

Triagent currently investigates and explains; the next step is autonomous remediation with guardrails — applying a manifest rollback, restarting a deployment, paging the right human only when escalation is warranted.

This requires:

Human-in-the-loop approval flows
Change windows and rollback safety nets
Audit trails for every action taken

The agent becomes a real on-call teammate, not just a smarter dashboard.

The end state: a production AI agent should be as reliable as a senior SRE on a calm Tuesday — and degrade gracefully when chaos hits, the same way a senior SRE does. Triagent today proves the resilience primitives. The vision is making those primitives the new floor.

Built With

fastapi
framer-motion
google-gemini
groq
httpx
k3d
kubernetes
langgraph
matplotlib
ollama
openrouter
prometheus
pydantic
python
react
tailwindcss
three.js
truefoundry
typescript
uvicorn
vite
zustand

Updates

Saugi Alkaf started this project — May 25, 2026 01:02 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.