ConfigGuardian AI

Deployment Pending
Deployment Allowed
Deployment Blocked

Inspiration

In July 2024, a single CrowdStrike configuration file crashed 8.5 million Windows machines globally — $3 billion in losses in 72 hours. In October 2025, a DNS timeout configuration change brought down AWS DynamoDB and 1,000+ dependent services for 7 hours. In 2024, an Intel microcode configuration parameter caused permanent, irreversible CPU damage across millions of processors.

These weren't code bugs. They were configuration changes that passed every existing validation check — valid syntax, correct schema, green CI/CD pipeline — and still caused catastrophic production failures. We asked: why doesn't a tool exist that learns from every past disaster and warns you before you repeat it? ConfigGuardian is our answer.

What it does

ConfigGuardian is an AI-powered configuration change risk predictor built on Amazon Nova 2 Lite. A developer submits a config diff (before/after). Within 8 seconds, three specialized AI agents analyze it in parallel:

Pattern Recognition Agent — compares the change against a database of real-world infrastructure disasters, calculates multi-factor similarity scores, flags dangerous patterns
Impact Analysis Agent — maps the service dependency graph, predicts cascading failures, estimates financial impact and blast radius
Decision Agent — synthesizes both analyses and makes an autonomous BLOCK / REVIEW / WARN / APPROVE decision

If risk score ≥ 86, deployment is automatically blocked. The developer sees which historical disaster their change resembles, the technical reason it's dangerous, the predicted blast radius, and a safer alternative configuration that achieves their original goal with near-zero risk.

How we built it

AI Layer: Three Amazon Nova 2 Lite agents with specialized system prompts and structured JSON output schemas. All three run in parallel.

Backend: FastAPI (Python) handles agent orchestration, parallel execution, disaster database queries, and streams results to the frontend via Server-Sent Events (SSE).

Disaster Database: Three real incidents fully catalogued with structured pattern signatures — CrowdStrike Falcon Sensor Update (July 2024), AWS DynamoDB DNS Outage (October 2025), Intel Vmin Shift Bug (2024). Each entry includes config type, parameter changed, change direction and magnitude, risk indicators, outcome data, cascade pattern, and safer alternatives.

Pattern Matching: Multi-factor similarity scoring — config type match (40 pts), parameter match (30 pts), change direction (20 pts), magnitude similarity (10 pts). Score > 85 triggers CRITICAL flag.

Frontend: React 18 + Vite, shadcn/ui, Tailwind CSS, Recharts for visualizations, SSE for live agent activity stream.

Stack: Amazon Nova 2 Lite, FastAPI, React 18, Vite, shadcn/ui, Tailwind CSS, Recharts, Server-Sent Events, JSON disaster database.

Challenges we ran into

Building the disaster database accurately was harder than expected. Extracting structured pattern signatures from real incidents — exact parameter names, change directions, magnitudes, cascade mechanisms — required deep research across post-mortems, news articles, and engineering analyses. Surface-level descriptions weren't enough; we needed the technical specifics for the pattern matching to work.

Parallel agent coordination in FastAPI. Getting three agents to execute concurrently and ensuring the Decision Agent always receives complete, valid outputs from both upstream agents required careful async handling and error recovery. We built retry logic for timing edge cases where one agent's response arrived malformed.

Making the explanation layer useful. Early versions returned a verdict with a risk score. Developer testing showed this felt like gatekeeping — unhelpful without context. Rebuilding the Decision Agent's output schema to include historical incident context, technical reasoning, and a safer alternative was a significant iteration.

Accomplishments that we're proud of

Built a structured disaster database from real post-mortems — a resource that didn't exist in this form publicly
Achieved consistent sub-8-second end-to-end analysis across three parallel Amazon Nova 2 Lite agents
Designed an explanation system that turns a block decision into a learning moment — developers understand not just that a change is dangerous, but exactly why and how to fix it safely
Built a pattern matching algorithm that operates on structural similarity, not keyword matching — it identifies dangerous patterns even when surface details differ
Full working demo with live agent activity streaming, risk score visualization, and historical incident comparison panel

What we learned

Agent specialization produces better outputs than single-agent approaches. Splitting responsibilities across three tightly scoped agents with clean input/output contracts gave us dramatically more consistent and parseable results from Amazon Nova 2 Lite.
The data layer is the real moat. Investing in disaster database quality had more impact on prediction accuracy than any amount of prompt tuning.
Amazon Nova 2 Lite handles structured reasoning tasks well. When given a bounded, specific task with a clear JSON output schema, the model returns consistent, parseable results fast enough to fit inside a developer's natural workflow.
Prevention tools need to teach, not just block. Every block decision is an opportunity to build developer intuition. Treating it as an educational moment rather than a gate changed how users experienced the product entirely.

What's next for ConfigGuardian

Expand the disaster database continuously — every new public post-mortem is a new entry
Company-specific pattern learning — ConfigGuardian builds institutional memory from your organization's own near-misses
IDE plugin for VSCode — catch dangerous config changes the moment a developer saves a file, before they even push
Extend to Kubernetes manifests, Terraform variables, and feature flag configurations
Canary deployment integration for REVIEW-level risks — auto-trigger 1% rollout with automated rollback on anomaly detection ```

Built With

amazon-nova-2-lite, fastapi, python, react, vite, tailwindcss, shadcn-ui, recharts, server-sent-events, javascript, json

Submitter Type

Team of Students

Submitter Country of Residence

India

App Category

Agentic AI

Which Amazon Nova Model

Amazon Nova 2 Lite

Testing Instructions (judges only, not public)

1. Open the live demo at [your URL]
2. On the dashboard, click "Try Demo" → select "Database Timeout Reduction"
3. Pre-loaded config change (connection_timeout: 30000 → 5000) will populate automatically
4. Click "Analyze Change" — watch the live agent activity stream
5. After ~8 seconds, a CRITICAL BLOCK decision appears with the matched historical incident (CrowdStrike pattern)
6. Click "View Detailed Analysis" for full explanation and safer alternative
7. For an APPROVE scenario, select "Safe Memory Limit Increase" from demo scenarios

GitHub: https://github.com/vaibhavnagdeo18/configguardian-ai