Inspiration

Modern cloud systems are highly interconnected, and failures in one service can quickly cascade across the stack. A database bottleneck, for example, can lead to API latency, payment failures, and business loss. Most dashboards only visualize the issue after it happens, but they do not actively reason about root cause or recover automatically.

I built NovaSRE to demonstrate how cloud reliability engineering can evolve from passive monitoring into autonomous, AI-assisted self-healing operations.

What it does

NovaSRE is an autonomous cloud self-healing system that simulates live infrastructure telemetry and incident response.

It continuously monitors:

  • API CPU utilization
  • database connection usage
  • database latency
  • payment service error rate
  • revenue throughput

The system detects both warning and critical states. When a critical database-driven incident occurs, NovaSRE:

  1. identifies the incident
  2. uses Amazon Nova through AWS Bedrock for root-cause reasoning
  3. determines affected services and business impact
  4. applies simulated mitigation automatically
  5. confirms recovery and logs stabilization

This creates a full detect → reason → heal → recover loop.

How I built it

I built NovaSRE using:

  • Python
  • Streamlit for the interactive dashboard
  • Pandas and NumPy for telemetry simulation and metric handling
  • NetworkX and Matplotlib for dependency graph visualization
  • Boto3 with AWS Bedrock
  • Amazon Nova Lite for structured AI-based incident reasoning

The project is organized into modular agents:

  • a detection agent for incident classification
  • a reasoning agent for Amazon Nova-based analysis
  • an execution agent for autonomous mitigation
  • a dependency agent for service impact visualization

A simulation module continuously generates healthy, degrading, critical, and recovering system behavior to create a realistic incident lifecycle.

Challenges I ran into

One challenge was designing the system so that it felt like a real autonomous workflow instead of just a dashboard with random numbers. I needed the metrics, incident thresholds, reasoning flow, mitigation behavior, and recovery logic to all connect meaningfully.

Another challenge was making the app demo-safe. Live AI reasoning can sometimes fail because of access, latency, or configuration issues, so I added a fallback reasoning path to ensure the project still demonstrates the full incident lifecycle reliably.

I also worked on improving the UI so the dependency graph, event logs, and incident flow tell a clear story during a short demo.

Accomplishments that I'm proud of

I’m proud that NovaSRE goes beyond basic monitoring and actually demonstrates autonomous infrastructure behavior.

Key accomplishments include:

  • building a live cloud reliability simulation
  • implementing warning and critical incident detection
  • integrating Amazon Nova for technical root-cause reasoning
  • simulating autonomous mitigation actions
  • confirming recovery through event logs and stabilized metrics
  • creating a full end-to-end SRE-inspired feedback loop

What I learned

Through this project, I learned how to structure an agentic workflow where each component has a distinct responsibility: detection, reasoning, execution, and recovery validation.

I also gained hands-on experience with:

  • Amazon Nova on AWS Bedrock
  • designing AI-assisted operational workflows
  • presenting system health, service dependencies, and business impact visually
  • making AI demos more reliable for real presentation scenarios

What's next for NovaSRE

Future improvements could include:

  • integration with real cloud telemetry sources like CloudWatch or Prometheus
  • multi-incident support
  • anomaly detection using machine learning
  • human approval workflows before mitigation
  • alerting integrations such as Slack or email
  • incident history analytics
  • richer business impact forecasting

NovaSRE is a step toward a future where cloud systems do not just report failures — they understand them, respond intelligently, and recover faster with AI assistance.

Built With

Share this project:

Updates