This phrase is new and it originated at Netflix back in 2010. I was listening to Nora Jones, a Netflix engineer at the AWS re-Invent conference few weeks back, where she talked about this. The principle of Chaos goes like this, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Distributed systems have too many moving parts and failures can occur at various levels – hard disks can fail, the network can go down, a sudden surge in customer traffic can overload a functional component—the list goes on. All too often, these events trigger outages, poor performance, and other undesirable behaviors. Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems, and builds confidence in the operational behavior of those systems.
Netflix moved its operation to the cloud back in 2008. They started some form of resiliency testing since that time. They introduced Chaos Monkey that systematically turned off services in the production systems. Then came Chaos Kong for large scale failures like shutting off a whole data center. Another tool called FIT (Failure Injection Testing) was introduced to test all scenarios between the small (Chaos Monkey) and very large (Chaos Kong). All these experiments culminated into what is called Chaos Engineering, a discipline now used across many large companies such as Google, Amazon, Microsoft, etc.
Applying Chaos Engineering improves the resilience of a system. By designing and executing Chaos Engineering experiments, you will learn about weaknesses in your system that could potentially lead to outages that cause customer harm. You can then address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.
So what is the difference between Chaos Engineering (experimentation) and testing? In testing, an assertion is made: given specific conditions, a system will emit a specific output. Tests are typically binary, and determine whether a property is true or false. Strictly speaking, this does not generate new knowledge about the system, it just assigns valence to a known property of it. Experimentation generates new knowledge, and often suggests new avenues of exploration. Examples of input for chaos experiments could span from maxing out cpu cores on an Elasticserach cluster to partially deleting kafka topics over a variety of instances to recreate an issue that occured in production. Numerous experiments can be performed to understand system behavior ahead of time and take corrective actions.
At Google, Kripa Krishnan leads a team that constantly breaks the system. So a small team of testers from other big companies have started to work together to share best practices. These folks are currently working on ways to automate some of the tests. “Right now, scale is our problem. We are doing hundreds of tests, but I cannot scale my team to hundreds of people. So we are exploring automating some of this. How do you constantly cause damage so systems are constantly recovering?”
As distributed systems get more complex with thousands of microservices providing various functions, chaos engineering is emerging as a key practice to make these systems more resilient to failure.