Failures happen in large distributed systems. We have seen recent worldwide outage of Facebook, Instagram, and WhatsApp. Fastly's outage happened mid this year and Google's outage at end of 2020. Failures happen!

But, can we identify these failures proactively? Chaos Engineering has been practiced widely to solve this problem. With Chef, we are automating the Chaos Engineering experiments that create turbulent attacks in your system to:

  1. Build confidence in your systems
  2. Uncover point of failures
  3. Detect failures proactively

What it does:

Our project - Automating for Failures, uses Chef Infra to automate the attacks on your system hosted on AWS, Google Cloud, or in your own infrastructure. These attacks generate several turbulent conditions in your system like:

  1. Rebooting servers
  2. Stoping services
  3. Killing running processes
  4. Blocking networks
  5. Simulating a failover in the database
  6. Using all of the server's CPU cores

How we built it:

Chaos Attacks Pipeline:

  1. Chef Workstation: Develop and test attacks
  2. Managed Chef server: Manage cookbooks, roles, policies, and nodes
  3. Chef nodes: Run chef-client to trigger attacks

Our target systems are:

Factsbook App: A web app that stores thousands of facts about space, ocean, music, humans, and cars.

  • Tech Stack for Factsbook: React, Python, Flask & MongoDB - hosted on AWS

Job Scheduling App: Job queueing app that is used by microservices to produce and consume jobs.

  • Tech Stack : Python, Flask, Redis & Docker - hosted on Github

Our DevOps Pipeline:

  • Git/Github: Source code management and version control
  • Jenkins: CI/DC pipeline
  • ELK Stack: Logging and Metrics
  • Slack: Build notifications
  • MongoDB Cloud Manager: Manage MongoDB database running on AWS EC2.

Challenges we ran into:

  • Onboarding hosts distributed on multi-cloud environments to our Chef pipeline.
  • Healing of system when attack executions are completed.

Accomplishments that we're proud of

  • Reducing Chaos engineering experiments time all of your system components from months to days.
  • Designing Chef cookbooks and recipes to generate various types of attacks.

What we learned

  • Chef Infra - Chef workstation, Chef server, and Chef client architecture.
  • Designing recipes/cookbooks to interact with multi-cloud environments.
  • Chef, Jenkins, and Slack pipeline.

What's next for Automating for Failures - Chaos Engineering with Chef

  • Automating attack's test pipeline using Chef kitchen and Chef Inspec
  • Create a dashboard for all summary of attacks and provide user-facing analytics
  • Add more attacks for other system components like - Webserver, Cloud Services, CI/CD pipelines.
Share this project: