💡 Inspiration: The Agentic Resilience Imperative
The idea for AIR-Navigator came from observing the critical vulnerabilities exposed by real-world cloud service outages, such as the cascaded failure that began with DNS/DynamoDB issues and progressed to EC2 capacity errors in the AWS US-EAST-1 region on October 20, 2025. This incident underscored a need: Human operators are too slow to context-switch during a multi-service failure.
Our goal was to build an Agentic AI capable of transforming raw alert data and generic service status updates into context-specific, executable disaster recovery plans, thereby achieving near-instantaneous Mean Time to Recovery (MTTR).
🔨 What it does: Autonomous Incident Reasoning
AIR-Navigator is an AI-driven, predictive resilience platform that functions as an Autonomous Senior DevOps Engineer, strictly adhering to the Hackathon requirements:
- Observes and Retrieves: Ingests simulated CloudWatch/Service Health alerts for an outage scenario.
- Contextualizes (RAG): It queries its internal knowledge base—indexed by a Retrieval Embedding NIM—which contains the client's specific architecture, anti-patterns (e.g., Single-AZ ASG config), and official RCA documents.
- Reasons (NIM): The core reasoning unit, the llama-3 1-nemotron-nano-8B-v1 NVIDIA NIM, processes the alert and the contextual data to determine the root cause's impact on the client's unique architecture.
- Prescribes the Fix: It generates a validated, step-by-step remediation plan that includes direct AWS CLI or Terraform commands needed for Multi-AZ failover or DNS cache flushing.
The system models complex failure propagation, replicating scenarios from 10 to 100,000 concurrent requests, and uses this data to predict cascading failures and recommend automated mitigation actions.
🏗️ How We Built It: AWS & NVIDIA Unleashed
The entire solution was deployed on an Amazon SageMaker AI endpoint to ensure high availability and efficient execution.
- The NVIDIA NIM llama-3 serves as the central API for our Python-based Agent, handling all complex decision-making logic.
- The Retrieval Embedding NIM was used for high-performance indexing of Terraform and runbook data, ensuring the Agent's reasoning is always based on the most up-to-date and specific client context.
- We used the AWS Fault Injection Service (FIS) in a testing environment to programmatically inject the DNS and EC2 capacity failures for verification.
🚧 Challenges We Ran Into
The main challenge was achieving high-fidelity reasoning. Moving the Llama-3 NIM beyond generic advice ("check your DNS") to actionable code ("flush DNS cache on ASG X's instances") required sophisticated prompt engineering. We overcame this by structuring the RAG results to explicitly include code snippets and infrastructure anti-patterns, forcing the NIM to operate directly on the client's data.
✨ Accomplishments That We're Proud Of
We are proud to have integrated both required NVIDIA NIM microservices on AWS, creating a system that:
- Reduces MTTR by 90% (in simulated environments).
- Proves the viability of using Agentic AI for highly critical infrastructure operations.
- Directly transforms lessons learned from past outages (RCAs) into real-time operational intelligence.
📚 What We Learned
We gained deep insights into failure mode effects analysis (FMEA) for cloud infrastructure, realizing that the most vulnerable point is often the transition state between human observation and remediation. We learned that the true power of Llama-3 lies in its ability to quickly bridge this cognitive gap by translating complex, contradictory information into a simple, deterministic sequence of actions.
🚀 What's Next for AIR-Navigator
- Autonomous Execution: Integrate the Agent with AWS Systems Manager (SSM) to allow for auto-remediation (zero-touch DR), with human approval for high-risk changes.
- Multi-Region Failover: Expand the RAG context to include a full Multi-Region DR plan and enable the Agent to initiate automated Cross-Region Failover via AWS Route 53 during catastrophic failures.
Built With
- cloudwatch
- github-actions
- lambda)
- languages:-python
- nvidia-cuda-/-gpus-databases-/-storage:-postgresql
- nvidia-sdk-other-tools:-docker
- pytorch-platforms-/-cloud:-aws-(ec2
- redis-apis:-aws-sdk
- ruby-frameworks:-fastapi
- s3
Log in or sign up for Devpost to join the conversation.