Inspiration

Modern Site Reliability Engineering (SRE) often involves "alert fatigue," where engineers are overwhelmed by false positives and manual diagnostic tasks. We were inspired to build AegisOps AI to move beyond passive monitoring and create an autonomous agent that doesn't just notify an engineer, but actively suggests or executes remediation steps in real-time.

What it does

AegisOps AI is a full-stack, intelligent incident-response engine. It intercepts raw server logs, routes them through a high-performance LLM (Crusoe Cloud Nemotron-3-Nano) to analyze patterns and severity, and returns a structured JSON payload containing the incident type, root cause, and a specific bash command to resolve the issue. If the primary AI engine is unreachable, the system automatically falls back to a high-speed heuristic engine to ensure continuous protection.

How we built it

We utilized a robust stack to ensure low-latency performance:

Backend: Hono (High-performance web framework) deployed on Render.

Inference: Crusoe Cloud Managed Inference for running the Nemotron-3-Nano LLM.

Intelligence: Custom JSON-Schema prompting to ensure all AI outputs are machine-parseable by our automated remediation engine.

Architecture: A circuit-breaker pattern was implemented to seamlessly toggle between the Primary AI engine and a secondary heuristic fallback for mission-critical reliability.

Challenges we ran into

The most significant hurdle was maintaining the strict execution environment required for the automated remediation scripts. We encountered permission constraints in Linux containers while attempting to execute shell-based commands. We solved this by architecting a "Heuristic Fallback" layer that acts as a fail-safe, ensuring the system remains functional even if the primary inference pipeline experiences rate-limiting or network volatility.

Accomplishments that we're proud of

We are particularly proud of the "Chaos-Proof" fallback mechanism. Achieving a zero-downtime diagnostic pipeline that can switch between high-level LLM intelligence and low-level rule-based heuristics is a significant step toward true autonomous SRE.

What we learned

Building AegisOps AI deepened our understanding of the challenges in deploying LLMs for real-world infrastructure management. We learned the importance of "output sanitization"—ensuring that the LLM's non-deterministic nature is constrained into valid, safe, and executable JSON formats—and the critical need for graceful degradation in AI-driven systems.

What's next for AegisOps AI

The next stage for AegisOps AI involves integrating direct infrastructure provider APIs (like AWS/GCP SDKs) so the system can move from suggesting bash commands to directly executing remediation workflows in isolated sandbox environments. We also aim to expand the model's training to handle multi-step, complex incident cascades.

Built With

Share this project:

Updates