Root Out Failures Before they Become Outages, with Chaos Engineering

Reading Time: 7 min

Deven Samant

Senior Director – Digital Engineering Practice

Apr 27, 2020 |

Posted in DevOps

Root Out Failures Before they Become Outages, with Chaos Engineering

The COVID-19 pandemic has highlighted stark weaknesses inherent in systems we take for granted most of the time. An approach that is rapidly gaining ground in digital engineering circles is chaos engineering, a way of building resilience and protect against unforeseen interruptions in service.

When it comes to the digital economy, it hardly needs to be stated that preventing downtime is paramount. Last year’s epic Facebook outage reportedly cost the company $90 million in lost revenue. Consider that for approximately a quarter of businesses worldwide, server downtime in 2019 costs an average of $300,000 to $400,000 per hour. And while companies, of course, measure the value of downtime in lost dollars, there are also hidden costs associated with reputational damage, lost opportunity, and diminished competitiveness.

How can Chaos engineering help?

Although it is not new in industrial and manufacturing settings, chaos engineering is a relatively new discipline in digital engineering. It involves experimenting with software in production to better understand faults and build confidence in the system’s overall capability to withstand turbulence. Netflix brought chaos engineering to the fore in 2011 with its cloud migration, when it created its Chaos Monkey tool, which, by randomly terminating instances in production, ultimately tested the stability of Netflix’s systems. In 2012, Netflix made Chaos Monkey available under license. The emerging lessons around chaos engineering, as well as its suite of Simian Army tools, have been utilized ever since.

While the principles behind chaos engineering have been gaining traction, clients are often (understandably) apprehensive because of a misperception that chaos engineering is all about deliberately breaking things. Terms like “blast radius” or “random terminations” and references to “chaos” or “storms” (Facebook’s name for it) don’t exactly help soothe their concerns.

In reality, experiments are meticulously planned from initial scoping to execution, and the insights they deliver are far-reaching.

That’s why I wanted to set out how engineering teams can begin to apply chaos engineering to root out failures in their system before they become outages, without blowing anything up, losing their jobs or making enemies with the rest of the business.

Define “normal”

One of the main benefits of chaos engineering is the sheer amount enterprises get to learn about their business. The impact of those findings is lessened if you are not able to measure what went wrong/right. So, understanding what constitutes “business as usual” is the first step.

Get buy-in from the business

Believe me, you do not want the first conversation about chaos engineering to happen after you have demonstrated to a co-worker just how unreliable one of their services is, or worse still, after an outage you just created. Gaining trust and taking people with you on this journey are critical to success.

Plan carefully

Chaos Engineering is not about randomly breaking things. Indeed, if you are confident that something will fail, then the answer is to fix it. Instead, you should put forward a hypothesis of what might happen in the event of X failing. The only time anything at all should happen randomly is if you plan for it to take place randomly.

Contain the Chaos

You are in control. You can limit the “blast radius” by planning carefully, and if necessary, reducing the scope of the experiment. These simulations are primarily a learning opportunity.

Learn and adapt

Take your findings and continue to experiment! Scale the hypothesis, introduce more than one failure, take account of unintended features such as latency, and carefully monitor how these variables not only impact adjacent services but how their effects can be felt further downstream.

The bottom line?

Chaos engineering teaches enterprises valuable lessons about their organization. It has the power to bring people together in a positive way to build confidence, improve experiences, and ultimately to result in innovation. When people discuss chaos engineering, it usually centers around how the discipline bolsters resilience, reduces risk, and otherwise affects the business. But the ultimate beneficiaries are customers in the form of improved user experience. Chaos engineering doesn’t just teach companies how to avoid technical glitches; it informs how they operate, communicate, and compete.

Interested in experimenting with chaos engineering? Get in touch using the form below.

Stay Updated