Chaos Engineering in AWS with Fault Injection Simulator

Inspiration

The inspiration came from a real incident — an unexpected outage due to a failed Availability Zone. That single failure sparked a big question in my mind: “What if we could simulate these events before they bring systems down?” That’s when I came across AWS Fault Injection Simulator (FIS). It promised a safe, controlled way to inject chaos into systems to boost resilience — and I had to try it out.

What it does

The project demonstrates how to design and run fault injection experiments using AWS FIS. It simulates failures like:

EC2 instance termination

Network latency and packet loss

CPU stress

These simulations are monitored in real time using CloudWatch, with automated recovery via Auto Scaling Groups and Elastic Load Balancers to ensure high availability.

How we built it

Environment Setup: Launched EC2 instances with proper tagging and set up CloudWatch for monitoring.

Experiment Creation: Created and ran fault injection experiments using AWS Console, AWS CLI, and CloudFormation.

Observability: Configured CloudWatch Alarms and SNS for alerts.

Automation: Used IAM roles, SSM documents, and rollback actions for safe, repeatable chaos experiments.

Challenges we ran into

IAM Permissions: Setting up least-privilege execution roles was tricky.

Tagging Consistency: Required a well-thought-out tagging strategy to target resources effectively.

Fear of Failure: Initially hesitant to run destructive tests, even in test environments.

Cost Awareness: Needed to monitor AWS usage to keep experiment costs under control.

Accomplishments that we're proud of

Successfully ran complex FIS experiments without causing uncontrolled outages.

Automated fault injection using Infrastructure as Code (CloudFormation).

Built a monitoring and recovery setup using native AWS tools.

Gained confidence in testing real failure scenarios in a controlled, repeatable way.

What we learned

Chaos Engineering is about trust, not destruction — it's a way to build confidence in systems.

AWS FIS is tightly integrated with other AWS services and requires precise IAM setups.

Observability and rollback plans are non-negotiable when injecting faults.

Proactive testing helps uncover weak spots you didn’t even know existed.

What's next for Chaos Engineering in AWS with Fault Injection Simulator

Integrate with CI/CD pipelines to run FIS experiments post-deployment.

Explore multi-region chaos testing for high-availability apps.

Use custom SSM documents for more flexible and app-specific fault scenarios.

Build a dashboard to visualize resilience metrics over time.

Built With

amazon-web-services

Updates

Aashish Kumar started this project — Apr 15, 2025 04:29 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.