Chaos Engineering Introduction

Introduction

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. To build a more resilient system by deliberate destruction.

Roadmap

  • 2010: Netflix Eng Tools team create Chaos Monkey. was created in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services(AWS), to ensure the failure of AWS instance won’t affect the Netflix service; create this tool to test the system
  • 2011: Siman Army added additional failure injection modes on top of Chaos Monkey 
  • 2012: Chaos Monkey open source
  • 2014: Netflix create a new role: the Chaos Engineer, a person helping company avoid outages by running proactive chaos engineering experiments
  • 2014: FIT, failure injection testing by Kolton Andrus, Naresh Gopalani, Ben Schmaus(https://netflixtechblog.com/fit-failure-injection-testing-35d8e2a9bb2 )
  • 2015: Chaos Kong, which doesn’t just kill a server. It kills an entire AWS Region.which helps to make their system was already strong enough to handle a traffic failover
  • 2016: Gremlin: the world’s first managed enterprise Chaos Engineering solution
  • 2017: ChaosToolkit: the chaos engineering toolkit for developers

Benefits

Basically, by using chaos engineering, can improve the stability and resilience of system.

  • Architect: measure the fault tolerance of the system architecture
  • Dev\SRE: improve the emergency response of failures
  • QA: fill the gaps of traditional testing methods; Find live bugs in advance, reduce failures
  •  PM&UI: improve customer experience
  • Customers: better experience

Advance Principles

Tools

The following are some chaos tools:

ToolsDescDependencyProject StateLink
ChaosmonkeyA resiliency tool that helps applications tolerate random instance failures.spinnakerno updatechaosmonkey 
Simian ArmyLatency Monkey\Conformity Monkey\Janitor Monkey\Security Monkey\10-18 Monkey\Chaos GorillaNno updateSimianArmy 
LitmuschaosHelps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native wayK8salivelitmus 
ChaosbladeAn easy to use and powerful chaos engineering experiment toolkit.Nalivechaosblade-io 
chaosblade-help-zh-cn
ChaoskubePeriodically kills random pods in your Kubernetes cluster.K8salivechaoskube 
PumbaChaos testing tool for DockerDockeralivepumba
ChaosmeshA cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environmentsK8salivechaos-mesh
docs

some comparison between the tools:

Practice

the following picture shows the chaos engineering process:

Chaosblade

Introduction

Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. (an Alibaba open source experimental injection tool )

IndexImplement
cpuburncpu
iodd
diskdd
dns/etc/hosts
networktc
iptables
processkill: kill -9
stop: kill -19 (recover: kill -18)
memorydd、mount 
scriptdelay: sleep
filemkdir touch chmod rm mv 

How to use

By Release Package

By Docker 

  • docker pull chaosbladeio/chaosblade-demo
  • docker run -it –privileged chaosbladeio/chaosblade-demo

Demo

OS – CPU

  • create cpu load 80%
    • ./blade create cpu load cpu-percent 80
  • destroy experiment
    • ./blade destroy {id}

Pumba

download pumba:

 wget https://github.com/alexei-led/pumba/releases/download/0.7.8/pumba_linux_amd64

demo:

use pumba to stop container and then restart in 30s

./pumba_linux_amd64 stop –duration 30s –restart {container}

use pumba to pause container process

./pumba_linux_amd64 pause –duration 60s {container}

References