Introduction
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. To build a more resilient system by deliberate destruction.
Roadmap
- 2010: Netflix Eng Tools team create Chaos Monkey. was created in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services(AWS), to ensure the failure of AWS instance won’t affect the Netflix service; create this tool to test the system
- 2011: Siman Army added additional failure injection modes on top of Chaos Monkey
- 2012: Chaos Monkey open source
- 2014: Netflix create a new role: the Chaos Engineer, a person helping company avoid outages by running proactive chaos engineering experiments
- 2014: FIT, failure injection testing by Kolton Andrus, Naresh Gopalani, Ben Schmaus(https://netflixtechblog.com/fit-failure-injection-testing-35d8e2a9bb2 )
- 2015: Chaos Kong, which doesn’t just kill a server. It kills an entire AWS Region.which helps to make their system was already strong enough to handle a traffic failover
- 2016: Gremlin: the world’s first managed enterprise Chaos Engineering solution
- 2017: ChaosToolkit: the chaos engineering toolkit for developers
Benefits
Basically, by using chaos engineering, can improve the stability and resilience of system.
- Architect: measure the fault tolerance of the system architecture
- Dev\SRE: improve the emergency response of failures
- QA: fill the gaps of traditional testing methods; Find live bugs in advance, reduce failures
- PM&UI: improve customer experience
- Customers: better experience
Advance Principles

Tools
The following are some chaos tools:
Tools | Desc | Dependency | Project State | Link |
Chaosmonkey | A resiliency tool that helps applications tolerate random instance failures. | spinnaker | no update | chaosmonkey |
Simian Army | Latency Monkey\Conformity Monkey\Janitor Monkey\Security Monkey\10-18 Monkey\Chaos Gorilla | N | no update | SimianArmy |
Litmuschaos | Helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way | K8s | alive | litmus |
Chaosblade | An easy to use and powerful chaos engineering experiment toolkit. | N | alive | chaosblade-io chaosblade-help-zh-cn |
Chaoskube | Periodically kills random pods in your Kubernetes cluster. | K8s | alive | chaoskube |
Pumba | Chaos testing tool for Docker | Docker | alive | pumba |
Chaosmesh | A cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments | K8s | alive | chaos-mesh docs |
some comparison between the tools:

Practice
the following picture shows the chaos engineering process:


Chaosblade
Introduction
Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. (an Alibaba open source experimental injection tool )
Index | Implement |
cpu | burncpu |
io | dd |
disk | dd |
dns | /etc/hosts |
network | tc iptables |
process | kill: kill -9 stop: kill -19 (recover: kill -18) |
memory | dd、mount |
script | delay: sleep |
file | mkdir touch chmod rm mv |
How to use
By Release Package
- wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/1.2.0/chaosblade-1.2.0-linux-amd64.tar.gz
- tar zxf chaosblade-1.2.0-linux-amd64.tar.gz
- cd chaosblade-1.2.0
- ./blade help
By Docker
- docker pull chaosbladeio/chaosblade-demo
- docker run -it –privileged chaosbladeio/chaosblade-demo
Demo
OS – CPU
- create cpu load 80%
- ./blade create cpu load cpu-percent 80
- destroy experiment
- ./blade destroy {id}


Pumba
download pumba:
wget https://github.com/alexei-led/pumba/releases/download/0.7.8/pumba_linux_amd64
demo:
use pumba to stop container and then restart in 30s
./pumba_linux_amd64 stop –duration 30s –restart {container}
use pumba to pause container process
./pumba_linux_amd64 pause –duration 60s {container}