Inspiration

This project started from a conversation with my cousin, an SRE. They were telling me about the dreaded "2 AM pager", waking up in the middle of the night just to restart a pod or revert a broken deployment.

I wondered if this was just an issue at their company or an industry-wide problem, so I started asking around on Reddit and DevOps communities. After talking to over 15 SREs, the consensus was clear: the biggest bottleneck in recovering from an incident is just the time it takes for a human to wake up, read the logs, and figure out what's going on. I built Praxis to fix that. It's an AI agent that wakes up at 2 AM, diagnoses the issue, writes the patch, and has a Merge Request ready before you even get out of bed.

How I built it

Praxis is set up as a multi-agent system to mimic a senior engineering team.

  • The Brain: I used Gemini 2.5 Pro to coordinate everything, and Gemini 2.5 Flash for the faster, smaller sub-tasks.
  • The Governance Board: I didn't want the AI pushing code blindly, so I set up a concurrent evaluation pipeline. Specialized AI "Critics" (focusing on Security, FinOps, and Architecture) review the patch. If the FinOps agent sees the patch will spike API costs, it vetoes the code and forces a rewrite.
  • The QA Agent: Once a patch is approved, another agent writes Playwright regression tests so the bug doesn't happen again.
  • Alert Tuning: I used Pinecone as a vector database for the agent's memory. I gave it a 2-hour TTL to prevent it from getting spammed by duplicate alerts.
  • Human-in-the-loop: I hooked it all up to Slack and GitLab. Praxis sends a summary and the patch to Slack. You just hit "Approve & Merge," and the webhook triggers the GitLab pipeline.

Challenges I ran into

Letting an AI write and commit real code is a bit nerve-wracking, and I definitely hit some walls:

  1. The "Split-Brain" Webhook Bug: I struggled a lot with state management between my local environment and the cloud webhooks. My Slack interactive buttons kept failing because the patch was in my local memory but Slack was pinging the cloud. I had to set up secure tunneling and fix the state management to get it working smoothly.
  2. Context Window Exhaustion: At first, I tried dumping thousands of lines of telemetry, logs, and Git history into the model all at once, which just led to hallucinations. I got around this by breaking the problem down into the multi-agent system, so each agent only gets the specific context it actually needs.

Accomplishments that I'm proud of

By automating the triage, root-cause analysis, and patching, Praxis basically shrinks that MTTR (Mean Time to Recovery) down to however long it takes a human to click "Approve."

I'm really proud of getting the multiple AI agents to actually debate each other on the Governance Board and consistently land on a safe, cost-effective fix without me having to hold their hands.

What I learned

I realized that AI in software engineering is less about replacing developers and more about shifting us from writing every line of code to reviewing it. On a technical level, I learned the hard way how complex Slack's interactive webhook payloads can be, and how to actually tune a Pinecone database for short-term memory.

What's next for Praxis

  • Auto-Rollbacks: Tying it into Datadog so Praxis can automatically revert Git commits if it sees a spike in 500 errors right after a deployment.
  • Runbook Ingestion: Letting teams upload their own custom PDF or Markdown runbooks so Praxis can follow specific internal compliance steps during an incident.

Built With

Share this project:

Updates