Smart Chaos : Leveraging Generative AI in Chaos Engineering

Inspiration

Modern enterprise distributed systems are inherently complex and challenging to manage. Numerous unknowns exist, leading us to employ chaos engineering to unveil them. However, traditional chaos testing heavily relies on human modeling and may struggle with the intricate complexities of current distributed systems. Seeking a solution, I began working with LLMs for inspiration, aiming to automate the entire chaos workflow using them.

What it does

Smart Chaos Apps use LLMs to automate the Chaos Engineering workflow

Discovery Phase Refer architecture diagram or service map to discover services using GenAI
Dependency Analysis Utilize GenAI to analyze system dependencies, identifying points of failure or vulnerabilities in the architecture. 3.Steady State Definition Utilize GenAI defining the system's steady state. 4.Hypothesis Creation Utilize GenAI to define hypothesis generation
Experiment Design Use GenAI for scenario/ experiment design simulation
Blast Radius Definition Utilize GenAI to assess the potential blast radius of chaos experiments, considering dependencies and system topology. 7.Rumsfeld Matrix - "Known Knowns, Known Unknowns, and Unknown Unknowns“ Leverage GenAI for data analysis to categorize knowns and unknowns, aiding in identifying potential unknown unknowns. 8.Smart Chaos Chat bot. Leverage GenAI base chat bot to improve the entire process ( continuous improvement )

How we built it

Users can input a link to the System Architecture Diagram or Service Map details.

Dependency Analysis: LLM analyzes the provided architecture diagram, identifying key system dependencies for potential chaos testing.

Steady State Definition: LLM examines the architecture diagram, performs Dependency Analysis, and generates the top 10 Steady State Definitions for chaos testing.

Hypothesis Generation: LLM utilizes the architecture diagram, Dependency Analysis, and Steady State Definitions to automatically generate hypotheses supporting chaos testing.

Experiment Design: LLM considers the architecture diagram, Steady State Definitions, and Hypothesis Generation to create Chaos testing test cases and an experiment list.

Rumsfeld Matrix - Known Unknown: LLM reviews the architecture diagram and lists the Rumsfeld Matrix – Known Unknown for chaos testing.

Test Case: Users can input one of the generated Test Cases.

Blast Radius: LLM refers to the Test Case and the architecture diagram, generating the Blast Radius for the specified Test Case during chaos testing.

Chat Bot : Having the ability to provide additional on-demand assistance.

Challenges we ran into

The main challenge was determining the amount of data to feed to LLM. Although LLM performed well, I decided to simplify by allowing users to provide an architecture diagram or runtime service map. While additional data like observability data could offer better details beyond service discovery, I opted for simplicity.

Accomplishments that we're proud of

I'm proud to have used LLMs to make Chaos workflows autonomous. With LLMs, we automated key areas in Chaos Engineering workflows:

Discovery Phase Dependency Analysis Steady State Definition Hypothesis Creation Experiment Design Blast Radius Definition Rumsfeld Matrix - Known Unknown Creating chatbots for advanced assistance.

What we learned

I learned about LLMs such as Claude, Jurassic-2, Titan, Command, Liama 2, and Stable Diffusion XL. I conducted experiments on these and gained a good understanding of their strengths, weaknesses, and when to use them.

What's next for Smart Chaos : Leveraging Generative AI in Chaos Engineering

I want to implement this in Bedrock, automating the chaos engineering workflow and integrating it with deployment pipelines. Upon deploying new code, GenAI will refer to the updated service map, design the chaos process, and automate implementation, creating Autonomous Chaos workflows.

GenAI will automate following areas : Discovery Phase Dependency Analysis Steady State Definition Hypothesis Creation Experiment Design Blast Radius Definition Rumsfeld Matrix - Known Unknown Creating chatbots for advanced assistance.

Building the training dataset involves collecting telemetry data to establish normal behavior and logging failures and dependencies for model training.

Creating chaos experiment templates includes defining scenarios like API failure, instance termination, and latency induction, which GenAI combines into building blocks.

Generating chaos experiments involves using the CloudWatch Service Map to understand dependencies and leveraging GenAI to analyze telemetry data, detecting critical flows. Experiment combinations are auto-generated using templates to target critical flows.

Executing experiments via AWS FIS includes injecting failure scenarios into non-critical paths first and gradually increasing the blast radius to critical components.

Evaluating outcomes involves analyzing metrics, logs, and traces to ensure proper recovery, with results fed back into GenAI to enhance chaos targeting.

Expanding scope iteratively includes increasing the magnitude and duration of failures and adding more failure scenarios.

Built With

partyrock

Updates

Indika Wimalasuriya started this project — Mar 09, 2024 01:07 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.