“The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100 percent uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link,” explains Netflix.
To know with certainty that a failure of an individual component won’t affect the availability of the entire system, it’s necessary to experience the failure in practice, preferably in a realistic and fully automated manner. When a system has been tested for a sufficient number of failures, and all the discovered weaknesses have been addressed, such system is very likely resilient enough to survive use in production.
“This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact,” says Netflix, whose early experiments with resiliency testing in production have given birth to a new discipline in software engineering: Chaos Engineering.
What Is Chaos Engineering?The core idea behind Chaos Engineering is to break things on purpose to discover and fix weaknesses. Chaos Engineering is defined by Netflix, a pioneer in the field of automated failure testing and the company that originally formalized Chaos Engineering as a discipline in the Principles of Chaos Engineering, as the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Chaos Engineering acknowledges that we live in an imperfect world where things break unexpectedly and often catastrophically. Knowing this, the most productive decision we can make is to accept this reality and focus on creating quality products and services that are resilient to failures.
Mathias Lafeldt, a professional infrastructure developer who’s currently working remotely for Gremlin Inc., says, “Building resilient systems requires experience with failure. Waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. By simulating potential errors in advance, we can verify that our systems behave as we expect—and to fix them if they don’t.”
In doing so, we’re building systems that are antifragile, which is a term borrowed from Nassim Nicholas Taleb’s 2012 book titled “Antifragile: Things That Gain from Disorder.” Taleb, a Lebanese-American essayist, scholar, statistician, former trader, and risk analyst, introduces the book by saying, “Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”
On his blog, Lafeldt gives another example of antifragility, “Take the vaccine—we inject something harmful into a complex system (an organism) in order to build an immunity to it. This translates well to our distributed systems where we want to build immunity to hardware and network failures, our dependencies going down, or anything that might go wrong.”
Just like with vaccination, the exposition of a system to volatility, randomness, disorder, and stressors must be executed in a well-thought-out manner that won’t wreak havoc on it should something go wrong. Automated failure testing should ideally start with the smallest possible impact that can still teach something and gradually become more impactful as the tested system becomes more resilient.
The Five Principles of Chaos Engineering“The term ‘chaos’ evokes a sense of randomness and disorder. However, that doesn’t mean Chaos Engineering is something that you do randomly or haphazardly. Nor does it mean that the job of a chaos engineer is to induce chaos. On the contrary: we view Chaos Engineering as a discipline. In particular, we view Chaos Engineering as an experimental discipline,” state Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”
In their book, the authors propose the following five principles of Chaos Engineering:
Hypothesize About Steady StateThe Systems Thinking community uses the term “steady state” to refer to a property where the system tends to maintain that property within a certain range or pattern. In terms of failure testing, the normal operation of the tested system is the system’s steady state, and we can determine what constitutes as normal based on a number of metrics, including CPU load, memory utilization, network I/O, how long it takes to service web requests, or how much time is spent in various database queries, and so on.
“Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system. If you add requests to a mid-tier service, will the steady state be disrupted or stay the same? If disrupted, do you expect the system output to increase or decrease?” ask the authors.
Vary Real-World EventsSuitable events for a chaos experiment include all events that are capable of disrupting steady state. This includes hardware failures, functional bugs, state transmission errors (e.g., inconsistency of states between sender and receiver nodes), network latency and partition, large fluctuations in input (up or down) and retry storms, resource exhaustion, unusual or unpredictable combinations of inter-service communication, Byzantine failures (e.g., a node believing it has the most current data when it actually does not), race conditions, downstream dependencies malfunction, and others.
“Only induce events that you expect to be able to handle! Induce real-world events, not just failures and latency. While the examples provided have focused on the software part of systems, humans play a vital role in resiliency and availability. Experimenting on the human-controlled pieces of incident response (and their tools!) will also increase availability,” warn the authors.
Run Experiments in ProductionChaos Engineering prefers to experiment directly on production traffic to guarantee both authenticity of the way in which the system is exercised and relevance to the currently deployed system. This goes against the commonly held tenet of classical testing, which strives to identify problems as far away from production as possible. Naturally, one needs to have a lot of confidence in the tested system’s resiliency to the injected events. The knowledge of existing weaknesses indicates a lack of maturity of the system, which needs to be addressed before conducting any Chaos Engineering experiments.
“When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave, and we write tests to verify the behaviors of these components. When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. In particular, state and input and other people’s systems lead to all sorts of system behaviors that are difficult to foresee,” write the authors.
Automate Experiments to Run ContinuouslyAutomation is a critical pillar of Chaos Engineering. Chaos engineers automate the execution of experiments, the analysis of experimental results, and sometimes even aspire to automate the creation of new experiments. That said, one-off manual experiments are a good place where to start with failure testing. After a few batches of carefully designed manual experiments, the next natural level we can aspire to is their automation.
“The challenge of designing Chaos Engineering experiments is not identifying what causes production to break, since the data in our incident tracker has that information. What we really want to do is identify the events that shouldn’t cause production to break, and that have never before caused production to break, and continuously design experiments that verify that this is still the case,” the authors emphasize what to pay attention to when designing automated experiments.
Minimize Blast RadiusIt’s important to realize that each chaos experiment has the potential to cause real damage. The difference between a badly designed chaos experiment and a well-designed chaos experiment is in the blast radius. The most basic way how to minimize the blast radius of any chaos experiment is to always have an emergency stop mechanism in place to instantly shut down the experiment in case it goes out of control. Chaos experiments should be built upon each other by taking careful, measured risks that gradually escalate the overall scope of the testing without causing unnecessary harm.
“The entire purpose of Chaos Engineering is undermined if the tooling and instrumentation of the experiment itself cause an undue impact on the metric of interest. We want to build confidence in the resilience of the system, one small and contained failure at a time,” caution the authors in the book.
Chaos at NetflixNetflix has been practicing some form of resiliency testing in production ever since the company began moving out of data centers into the cloud in 2008. The first Chaos Engineering tool to gain fame outside Netflix’s offices was Chaos Monkey, which is currently in version 2.0.
“Years ago, we decided to improve the resiliency of our microservice architecture. At our scale, it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning. If we don’t have proper redundancy and automation, these disappearing servers could cause service problems. The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours,” explains Netflix.
The rate at which Chaos Monkey turns off servers is higher than the rate at which server outages happen normally, and Chaos Monkey is configured to turn off servers during production hours. Thus, engineers are forced to build resilient services through automation, redundancy, fallbacks, and other best practices of resilient design.
While previous versions of Chaos Monkey were additionally allowed to perform actions like burning up CPU and taking storage devices offline, Netflix uses Chaos Monkey 2.0 to only terminate instances. Chaos Monkey 2.0 is fully integrated with Netflix’s open source multi-cloud continuous delivery platform, Spinnaker, which is intended to make it easy to extend and enhance cloud deployment models. The integration with Spinnaker allows service owners to set their Chaos Monkey 2.0 configs through the Spinnaker apps, and Chaos Monkey 2.0 to get information about how services are deployed from Spinnaker.
Once Netflix realized the enormous potential of breaking things on purpose to rebuild them better, the company decided to take things to the next level and move from the small scale to the very large scale with the release of Chaos Kong in 2013, a tool capable of testing how their services behave when a zone or an entire region is taken down. According to Nir Alfasi, a Netflix engineer, the company practices region outages using Kong almost every month.
“What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale. This is where FIT (Failure Injection Testing) comes in,” stated Netflix in early 2014, after realizing that they need a finer degree of control when deliberately breaking things than their existing tool allowed for at the time. FIT is a platform designed to simplify the creation of failure within Netflix’s ecosystem with a greater degree of precision. FIT also allows Netflix to propagate its failures across the entirety of Netflix in a consistent and controlled manner. “FIT has proven useful to bridge the gap between isolated testing and large-scale chaos exercises, and make such testing self-service.”
Once the Chaos Engineering team at Netflix believed that they had a good story at small scale (Chaos Monkey) and large scale (Chaos Kong) and in between (FIT), it was time to formalize Chaos Engineering as a practice, which happened in mid-2015 with the publication of the Principles of Chaos Engineering. “With this new formalization, we pushed Chaos Engineering forward at Netflix. We had a blueprint for what constituted chaos: we knew what the goals were, and we knew how to evaluate whether or not we were doing it well. The principles provided us with a foundation to take Chaos Engineering to the next level,” write Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”
The latest notable addition to Netflix’s Chaos Engineering family of tools is ChAP (Chaos Automation Platform), which was launched in late 2016. “We are excited to announce ChAP, the newest member of our chaos tooling family! Chaos Monkey and Chaos Kong ensure our resilience to instance and regional failures, but threats to availability can also come from disruptions at the microservice level. FIT was built to inject microservice-level failure in production, and ChAP was built to overcome the limitations of FIT so we can increase the safety, cadence, and breadth of experimentation,” introduced Netflix their new failure testing automation tool.
Although Netflix isn’t the only company interested in Chaos Engineering, their willingness to develop in the open and share with others has had a profound influence on the industry. Besides regularly speaking at various industry events, Netflix’s GitHub page contains a wealth of interesting open source projects that are ready for adoption.
Chaos Engineering is also being embraced by Etsy, Microsoft, Jet, Gremlin, Google, and Facebook, just to name a few. These and other companies have developed a comprehensive range of open source tools for different use cases. The tools include Simoorg (LinkedIn’s own failure inducer framework), Pumba (a chaos testing and network emulation tool for Docker), Chaos Lemur (self-hostable application to randomly destroy virtual machines in a BOSH-managed environment), and Blockade (a Docker-based utility for testing network failures and partitions in distributed applications), just to name a few.
Learn to Embrace ChaosIf you now feel inspired to embrace the above-described principles and the tool to create your own Chaos Engineering experiments, you may want to adhere to the following Chaos Engineering experiment design process, as outlined in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”
- Pick a hypothesis
- Decide what hypothesis you’re going to test and don’t forget that your system includes the humans that are involved in maintaining it.
- Choose the scope of the experiment
- Strive to run experiments in production and minimize blast radius. The closer your test is to production, the more you’ll learn from the results.
- Identify the metrics you’re going to watch
- Try to operationalize your hypothesis using your metrics as much as possible. Be ready to abort early if the experiment has a more serious impact than you expected.
- Notify the organization
- Inform members of your organization about what you’re doing and coordinate with multiple teams who are interested in the outcome and are nervous about the impact of the experiment.
- Run the experiment
- The next step is to run the experiment while keeping an eye on your metrics in case you need to abort it.
- Analyze the results
- Carefully analyze the result of the experiment and feed the outcome of the experiment to all the relevant teams.
- Increase the scope
- Once you gain confidence running smaller-scale experiments, you may want to increase the scope of an experiment to reveal systemic effects that aren’t noticeable with smaller-scale experiments.
- The more regularly you run your Chaos Experiments, the more value you can get out of them.