Self-Healing SRE Agent

live_logs_monitoring
resolved_state_artifact
initial_state_dashboard

Inspiration

In the "Action Era" of AI, we wanted to move beyond simple chatbots. Every developer knows the pain of being on-call—waking up at 3 AM to fix a production outage. We asked a fundamental question: "Can Gemini 3 not just tell us what's wrong, but actually fix it reliably without waking us up?" We wanted to build an agent that acts as a true Site Reliability Engineer, capable of reasoning through complex system failures.

What it does

The Self-Healing SRE Agent is an autonomous system that manages the entire incident lifecycle:

Detects: Continuously monitors production logs for error spikes.
Diagnoses: Uses Gemini 3 Pro (Reasoning Level: HIGH) to analyze stack traces, understand context, and identify the root cause.
Governs: Generates a cryptographic "Thought Signature" of its reasoning. This prevents hallucinated fixes; the execution layer only runs commands that are signed by the diagnosis layer.
Pinpoints: autonomously runs git bisect to identify the specific commit that introduced the bug.
Heals: Decides whether to Revert the bad commit or apply a Hotfix, and executes the repair via Kubernetes.

How we built it

Gemini 3 Pro API: Utilized for high-level reasoning and root cause analysis.
Google Antigravity: Used to scaffold the agent's skills and orchestration loop.
Python (Backend): Handles the core logic (agent_loop.py), Git integration skills, and Log monitoring.
React & Vite (Frontend): Built a simulation dashboard to visualize the agent's "brain" and state changes in real-time.
Pydantic: Used for strict data modeling of the "Incident Artifacts" to ensure type safety.

Challenges we ran into

The biggest challenge was Trust. Giving an AI permission to modify production environments is risky. We struggled with how to prevent the agent from "hallucinating" a fix that could make things worse. We solved this by implementing Thought Signatures—a hash chain that links the identified problem to the proposed solution. If the signature doesn't match at the execution step, the agent halts immediately.

Accomplishments that we're proud of

We are proud of building a "Marathon Agent" that completes a full loop—from error detection to git-revert—without human intervention. Seeing the agent autonomously identify a bad commit and roll it back in the simulation was a huge "Aha!" moment for us. We successfully demonstrated that AI can be an active guardian of code reliability, not just a passive coding assistant.

What we learned

We learned that Governance is just as important as Intelligence. Building a smart agent is easy; building a safe agent requires strict state management and verification loops. We also learned how to leverage Gemini 3's "High Thinking Level" to solve logic puzzles that would stump standard models.

What's next for Self-Healing SRE Agent

Real Cluster Integration: Moving from mock K8s commands to actual kubectl execution on a live GKE cluster.
Preventative Healing: evolving the agent to predict crashes before they happen by analyzing latency trends.
Multi-Service Debugging: expanding the scope to trace errors across microservices using distributed tracing data.

Built With

gemini-3-pro
git
google-antigravity
kubernetes
python
react
vite

Updates

Harsh Fiske started this project — Jan 15, 2026 06:51 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.