An ongoing & curated collection of awesome software, frameworks and libraries, learning tutorials and videos, technical guidelines and best practices about Chaos Engineering. Thanks to our daily readers and contributors. The goal is to build a categorized community-driven collection of very well-known resources. Sharing, suggestions and contributions are always welcome!
- 0. Introduction
- 1. Principles of Chaos Engineering
- 2. Fault Injection
- 3. Observability
- 4. Incident Management Tool
- 5. Cost of SEVs
- 6. Chaos As A Sevice
- 7. Gamedays
- 8. Forums and Groups
- 9. References
- 10. License
- 11. Contributing
One of the earliest examples of a chaos fault injection was disabling servers using a tool created by Netflix called Chaos Monkey. Chaos Monkey worked by randomly disabled production server instances to ensure that they could handle such failure scenarios.
Similarly, for Kubernetes there are tools targeting deleting pods, such as Kube-monkey, Target's Pod-reaper and Powerfulseal made by Bloomberg.
For Docker a tool called Pumba by A. Ledenev can target containers and for Docker Swarm the docker-chaosmonkey can target services.
Other tools in this area include BBC's Chaos Lamdba for terminating EC2 instances and GomJabbar for targeting private clouds.
Chaos related fault injection can also be done on a more application specific level. Two such system are ChaosMachine and TripleAgent targeting JVM based applications.
The network layer is another fault injection vector with a lot of support from tools. Quite a few of them utilize Iptables in combination with Traffic Control network emulation to inject different kinds of network failures including latency and dropping a percentage of traffic.
Open-source examples include Netflix's Latency Monkey, Pumba, Blockade, Muxy, and Comcast.
Some close-source alternatives are Gremlin and ChaosCat. All tools are capable to be used either directly with a deployment environment or with some setup.
For more resource based injection, at the level of CPU, RAM, disk and similar, there are tools that can help with this. Gremlin, for example, can execute several such attacks, both ChaosCat and a dedicated tools like cpu-troll can facilitate the execution of CPU usage attacks.
A chaos experiment is defined as the following five points by the Principles of chaos engineering
- Build a Hypothesis around Steady State Behavior
- Vary Real-world Events
- Run Experiments in Production
- Automate Experiments to Run Continuously
- Minimize Blast Radius
More details in the following link ;-)
- The Simian Army - A suite of tools for keeping your cloud operating in top form.
- Chaos Monkey - A resiliency tool that helps applications tolerate random instance failures.
- Chaos Toolkit - A chaos engineering toolkit to help you build confidence in your software system.
- Chaos Toolkit Turbulence - This is an extension for Chaos Toolkit which adds support for Turbulence attacks.
- Monarch - This is a series of tools for Chaos Toolkit.
- Muxy - A chaos testing tool for simulating a real-world distributed system failures.
- Chaos Blade - Chaosblade is an experimental tool that follows the principles of Chaos Engineering and is used to simulate common fault scenarios, helping to improve the recoverability of faulty systems and the fault tolerance of faults.
- Cthulhu - Chaos Engineering tool that helps evaluating the resiliency of microservice systems simulating various disaster scenarios against a target infrastructure in a data-driven manner.
- Namazu - Programmable fuzzy scheduler for testing distributed systems.
- Chaos Scimmia - Chaos Engineering for Redis.
- HavocLeopard - A set of simple chaos engineering apps that can be used to royally screw up your on-prem servers.
- Arcdata - Open source incident management and volunteer scheduling application for Red Cross Disaster Services.
- AWS Chaos Scripts - Collection of python scripts to run failure injection on AWS infrastructure.
- Toxiproxy - A TCP proxy to simulate network and system conditions for chaos and resiliency testing.
- Infection Monkey - Open source security tool for testing a data center's resiliency to perimeter breaches and internal server infection. The Monkey uses various methods to self propagate across a data center and reports success to a centralized Monkey Island server.
- ChaoSlingr - Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation.
- What is security chaos engineering and why is it important?
- Security Chaos Engineering: A new paradigm for cybersecurity
- Injecting chaos experiments into security log pipelines
- Purple testing and chaos engineering in securityexperimentation
- A new approach to security instrumentation
- ChaosCat - Chaos engineering for Pull Requests - Taking a not-even-good joke a bit too far.
- Byteman - A Swiss Army Knife for Byte Code Manipulation.
- Byte-Monkey - Bytecode-level fault injection for the JVM. It works by instrumenting application code on the fly to deliberately introduce faults like exceptions and latency.
- Perses - A project to cause (controlled) destruction to a JVM application.
- Wiremock - API mocking (Service Virtualization) which enables modeling real world faults and delays.
- MockLab - API mocking (Service Virtualization) as a service which enables modeling real world faults and delays.
- Flaw - Inject failures on api calls for local chaos engineering.
- Havoc - Havoc is a collection of dangerous code that wreck havoc in .NET applications and the operating system for chaos-engineering.
- Utilities for frontend chaos engineering - Utilities for frontend chaos engineering.
- CHAOS GOPHER - A collection of unix style tools in GO to do chaos engineering or testing.
- Chaos Monkey for Spring Boot - Injects latencies, exceptions, and terminations into Spring Boot applications.
- React Chaos - Chaos Engineering for your React apps.
- Chaos QoaLa - ChaosQoaLa is a chaos engineering tool for injecting failure into JavaScript backed GraphQL end points.
- Chaos Reverse-engineering - Chaos engineering approach by Reverse-engineering.
- Royal Chaos - This repository contains the chaos engineering systems invented at KTH Royal Institute of Technology.
- Pumba - Chaos testing and network emulation for Docker containers (and clusters).
- Blockade - Docker-based utility for testing network failures and partitions in distributed applications.
- Chaos Engineering for Docker - Chaos Engineering for Docker.
- Chaos Engineering with Docker EE - Chaos Engineering with Docker EE.
- Chaos Util - Docker image with utilities for Chaos Engineering.
- Drax - DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach.
- Pod-Reaper - A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes.
- Chaoskube - ChaosKube periodically kills random pods in your Kubernetes cluster.
- Litmus - Framework for Kubernetes environments that enables users to run test suites, capture logs, generate reports and perform chaos tests.
- Chaos Operator - Chaos engineering via kubernetes operator.
- Kube Entropy - A little chaos engineering application for kubernetes resilience testing.
- Chaos Coordinator - Chaos Coordinator is a set of tools that allow for chaos testing of the infrastructure used by Kubernetes clusters on Azure.
- kubernetes-chaos-lab - A brief guide to setting up your first chaos engineering lab on Kubernetes!.
- Chaos Mesh - A Chaos Engineering Platform for Kubernetes.
- VMware Mangle - Orchestrating Chaos Engineering.
- Turbulence - Tool focused on BOSH environments capable of stressing VMs, manipulating network traffic, and more. It is very simmilar to Gremlin.
- Glooshot - Chaos engineering framework to help you Immunize your service mesh.
- kube-monkey - An implementation of Netflix's Chaos Monkey for Kubernetes clusters.
- Powerful Seal - PowerfulSeal adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down.
- KubeInvaders - Gamfied Chaos engineering tool for Kubernetes Clusters.
- Testing Amazon Aurora Using Fault Injection Queries - Testing Amazon Aurora Using Fault Injection Queries.
- Azure Fault Analysis Service
- Include controlled Chaos in Service Fabric clusters - Include controlled Chaos in Service Fabric clusters.
- Chaos Lambda - Randomly terminate ASG instances during business hours.
- GomJabbar - ChaosMonkey for your private cloud.
- Chaos Engineering on Google Cloud Platform - Chaos Engineering on Google Cloud Platform.
- Chaos SSM Documents - Collection of AWS SSM Documents to perform Chaos Engineering experiments.
- A Chaos Engineering Bootcamp - A Chaos Engineering Bootcamp.
- HW4 - Express servers were used to implement service topologies.
- Serverless Chaos Engineering Demo - This example demonstrates how to use Adrian Hornsby's Failure Injection Layer to perform chaos engineering experiments on a serverless environment.
- Chaos Engineeing Demo - Simple project demonstrating chaos engineering with Chaos Monkey and Resiliance4J.
- Chaos Engineering Demo - resilience4j + chaos toolkit + wiremock + chaos monkey for spring boot sample application.
- How to Create a Kubernetes Cluster on Ubuntu 16.04 with kudeadm and Weave Net
- Banjaxed - Open source incident management tool.
- Availability Calculator - Calculate how much downtime should be permitted in your SLA.
- Gremlin Inc. - Failure as a Service.
- Chaos Engineering Experiment Automation - Chaos Engineering Experiment Automation.
- Pystol.org - The cloud chaos engineering toolbox.
- Cyphon - Open source incident management and response platform.
- Controlled Chaos - An all-in-one application that allows teams to create, execute, and analyze chaos engineering experiments with no previous DevOps experience or additional infrastructure setup.
- Chaos Platform - Chaos Engineering Platform for Everyone.
- Chaos Hub - Chaos Hub stands on the shoulders of the Chaos Toolkit to provide a complete, user-friendly, platform to automate and collaborate on your Chaos Engineering and Resiliency efforts.
- steadybit - Chaos Engineering platform that helps to proactively reduce downtime and provide visibility into systems to detect issues
- Target: What is a Gameday? - Chaos Gamedays experience by Target.
- Codecentric: Chaos Engineering Gamedays - Chaos Gamedays by Codecentric.
- New Relic: How to run a Gameday? - Chaos Gamedays experience by New Relic.
- Dius: Gamedays resources - Resources for getting started with GameDay and Chaos Engineering.
- Gremlin: Gamedays - Resources for getting started with GameDay and Chaos Engineering.
- Gremlin: Planning your own Chaos Day - Example of a Gameday with DynamoDB by Gremlin.
- Gremlin: How to run a Gameday? - Methodology to run Gamedays according Gremlin.
- Gremlin DB: Breaking Dynamo DB - Example of a Gameday with DynamoDB by Gremlin.
- Gremlin: Introduction to Gameday - What is a Gameday according Gremlin.
- Gremlin: Inside Gremlin 2019 Gremlin Gamedays Roadmap - Chaos Gamedays experience by Gremlin.
- Gremlin: What I lerned running the Chaos Lab with Kafka - Example of a Gameday with Kafka by Gremlin.
- Chaos Toolkit: Chaos Engineering with Humans in the loop - Article about Chaos Gamedays.
- GooCardless: All fun and games until you start with Gamedays - Article about Chaos Gamedays.
- InfoQ: Gamedays - Achieving Resilience through Chaos Engineering - InfoQ Presentation with experiences about Chaos Gamedays.
- CNCF Chaos Engineering Working Group
- CNCF Chaos Engineering Working Group Slack: #chaosengineering (slack.cncf.io)
- CNCF Chaos Engineering Working Group GitHub
- Chaos Toolkit Slack Community
- https://github.com/chaoseng/wg-chaoseng/blob/master/WHITEPAPER.md
- https://docs.google.com/document/d/1BeeJZIyReCFNLJQrZjwA4KMlUJelxFFEv3IwED16lHE/edit?ts=5ace0eab#heading=h.ephtflhfpd1d
- https://github.com/dastergon/awesome-chaos-engineering
- https://techbeacon.com/app-dev-testing/chaos-engineering-testing-34-tools-tutorials
- https://github.com/dastergon/awesome-chaos-engineering
- https://www.techrepublic.com/article/chaos-engineering-a-cheat-sheet/
- https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247
- https://thenewstack.io/gremlins-tammy-butow-on-the-business-side-of-chaos-engineering/
- https://learnk8s.io/blog/kubernetes-chaos-engineering-lessons-learned
- https://gocardless.com/blog/game-days-at-gc/
- https://engineering.grab.com/chaos-engineering
- https://blog.newrelic.com/engineering/chaos-engineering-explained/
- https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering-3434422afb54
- https://www.usenix.org/system/files/osdi18-veeraraghavan.pdf
- https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
- https://people.ucsc.edu/~palvaro/fit-ldfi.pdf
- https://landing.google.com/sre/book.html
- http://the-cloud-book.com/
- https://www.infoq.com/minibooks/emag-chaos-engineering
- https://www.pagerduty.com/blog/failure-fridays-four-years/
- https://www.slideshare.net/zgrinch/monkeys-lemurs-and-locusts-oh-my
- https://www.cloudreach.com/fr/blog/training-cloud-operations-teams-met-office/
- https://softwareengineeringdaily.com/2018/02/02/chaos-engineering-with-kolton-andrus/
- https://blog.codeship.com/embracing-the-chaos-of-chaos-engineering/
- https://sharpend.io/chaos-monkey-for-fun-and-profit/
- https://queue.acm.org/detail.cfm?id=2353017
- https://dl.acm.org/citation.cfm?id=3177123.3158134
- https://dl.acm.org/citation.cfm?id=2723711
- https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/
- https://devops.com/netflix-the-simian-army-and-the-culture-of-freedom-and-responsibility/
- http://www.oreilly.com/webops-perf/free/chaos-engineering.csp
- http://www.oreilly.com/webops-perf/free/antifragile-systems-and-teams.csp
- http://shop.oreilly.com/product/0636920251897.do
- https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
- https://www.gremlin.com/blog/the-discipline-of-chaos-engineering/
- http://kth.diva-portal.org/smash/get/diva2:1366436/FULLTEXT01.pdf
- https://arxiv.org/pdf/1907.13039.pdf
- https://arxiv.org/abs/1404.3056
- https://arxiv.org/abs/1702.05843
- https://arxiv.org/abs/1702.05849
- https://arxiv.org/abs/1805.05246
- https://arxiv.org/abs/1812.10706
- https://medium.com/@bbideep/why-should-chaos-be-part-of-your-distributed-systems-engineering-5bcb21497660
- https://medium.com/@njones_18523/chaos-engineering-traps-e3486c526059
- https://medium.com/@adhorn/chaos-engineering-ab0cc9fbd12a
- https://medium.com/netflix-techblog/fit-failure-injection-testing-35d8e2a9bb2
- https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116
- https://medium.com/netflix-techblog/automated-failure-testing-86c1b8bc841f
- https://medium.com/netflix-techblog/chaos-engineering-upgraded-878d341f15fa
- https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f
- https://medium.com/@crochefolle/how-to-convince-your-boss-to-make-them-say-yes-to-chaos-engineering-796ba119bd7
- https://medium.com/chaosiq/cloud-native-and-chaos-engineering-20842ee2fa8a
- https://www.wired.com/story/netflix-ddos-attack/
- https://github.com/gremlin/chaos-engineering-tools
- https://github.com/greenlearner01/Chaos-Engineering/blob/master/Chaos-Engineering.md
Contributions welcome! Read the contribution guidelines first.
MIT License
This work is licensed under a Creative Commons Attribution 4.0 International License.