NovaSRE - Autonomous Cloud Self-Healing System

Live cloud health dashboard showing service status,database load trends, real-time monitoring across API,database,payment,revenue systems.
Live cloud health dashboard showing service status, database load trends,real-time monitoring across API,database,payment,revenue systems.
Autonomous event log capturing the full incident lifecycle,early warning detection,critical escalation,mitigation execution,system recovery
Amazon Nova-powered root cause analysis during a critical database saturation incident, followed by mitigation to stabilize system.

Inspiration

Modern cloud systems are highly interconnected, and failures in one service can quickly cascade across the stack. A database bottleneck, for example, can lead to API latency, payment failures, and business loss. Most dashboards only visualize the issue after it happens, but they do not actively reason about root cause or recover automatically.

I built NovaSRE to demonstrate how cloud reliability engineering can evolve from passive monitoring into autonomous, AI-assisted self-healing operations.

What it does

NovaSRE is an autonomous cloud self-healing system that simulates live infrastructure telemetry and incident response.

It continuously monitors:

API CPU utilization
database connection usage
database latency
payment service error rate
revenue throughput

The system detects both warning and critical states. When a critical database-driven incident occurs, NovaSRE:

identifies the incident
uses Amazon Nova through AWS Bedrock for root-cause reasoning
determines affected services and business impact
applies simulated mitigation automatically
confirms recovery and logs stabilization

This creates a full detect → reason → heal → recover loop.

How I built it

I built NovaSRE using:

Python
Streamlit for the interactive dashboard
Pandas and NumPy for telemetry simulation and metric handling
NetworkX and Matplotlib for dependency graph visualization
Boto3 with AWS Bedrock
Amazon Nova Lite for structured AI-based incident reasoning

The project is organized into modular agents:

a detection agent for incident classification
a reasoning agent for Amazon Nova-based analysis
an execution agent for autonomous mitigation
a dependency agent for service impact visualization

A simulation module continuously generates healthy, degrading, critical, and recovering system behavior to create a realistic incident lifecycle.

Challenges I ran into

One challenge was designing the system so that it felt like a real autonomous workflow instead of just a dashboard with random numbers. I needed the metrics, incident thresholds, reasoning flow, mitigation behavior, and recovery logic to all connect meaningfully.

Another challenge was making the app demo-safe. Live AI reasoning can sometimes fail because of access, latency, or configuration issues, so I added a fallback reasoning path to ensure the project still demonstrates the full incident lifecycle reliably.

I also worked on improving the UI so the dependency graph, event logs, and incident flow tell a clear story during a short demo.

Accomplishments that I'm proud of

I’m proud that NovaSRE goes beyond basic monitoring and actually demonstrates autonomous infrastructure behavior.

Key accomplishments include:

building a live cloud reliability simulation
implementing warning and critical incident detection
integrating Amazon Nova for technical root-cause reasoning
simulating autonomous mitigation actions
confirming recovery through event logs and stabilized metrics
creating a full end-to-end SRE-inspired feedback loop

What I learned

Through this project, I learned how to structure an agentic workflow where each component has a distinct responsibility: detection, reasoning, execution, and recovery validation.

I also gained hands-on experience with:

Amazon Nova on AWS Bedrock
designing AI-assisted operational workflows
presenting system health, service dependencies, and business impact visually
making AI demos more reliable for real presentation scenarios

What's next for NovaSRE

Future improvements could include:

integration with real cloud telemetry sources like CloudWatch or Prometheus
multi-incident support
anomaly detection using machine learning
human approval workflows before mitigation
alerting integrations such as Slack or email
incident history analytics
richer business impact forecasting

NovaSRE is a step toward a future where cloud systems do not just report failures — they understand them, respond intelligently, and recover faster with AI assistance.

Built With

bedrock
boto3
cli
matlab
network
nova
numpy
pandas
python
streamlit

Updates

Shefali wanjari started this project — Mar 15, 2026 01:16 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.