Chaos Engineering: Navigating Turbulence in Production
This article is the second in a three-part blog series on chaos engineering. You can read the first part here.
It is not hard to understand why digital engineers get a little bit nervous when we start to talk about chaos engineering. ‘Shift Left’ has taught engineering teams the value of testing and fixing bugs as early as possible in the digital lifecycle, especially if they have ever spent significant time unraveling problems that were not discovered early.
So, when we discover faults earlier on, that must mean better quality software, and fewer late nights for hard-working development teams fixing issues, right? If only it worked that way.
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” (Leslie Lamport)
With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where chaos engineering comes in. Traditional software testing verifies the code is doing what we want it to (and continues to be an essential part of digital engineering). Chaos engineering, meanwhile, is a way of testing that the entire system is doing what we want it to, and code is just one part of the mix. To do this effectively, we must test the system in production. This is because many other factors, like state, inputs, and how external systems behave, all play a part in the way a system runs.
This complexity has given rise to the idea of “dark debt,” referring to the unforeseen anomalies that happen in complex systems when different parts of the software and hardware interact with one another in ways we can’t predict. The term borrows from the concepts behind “technical debt” (IT) and “dark matter” (space) to suggest the inevitable, unseen complications that arise in complex systems. This is exactly what chaos engineering seeks to identify. How that turbulence in production is managed is a critical part of the planning that needs to go into every experiment. Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system.
No surprises
The approach Apexon advocates is one we outlined in the first of our chaos engineering blogs. Talk to co-workers, explain your plans, and don’t do anything if you suspect it will fail. (In that case, fix the weakness). Chaos engineering is no substitute for resiliency planning and patterns. Instead, organizations embarking on chaos engineering should carefully create hypotheses they wish to prove, considering how to limit their blast radius. The meticulously planned reality of chaos engineering is a far cry from how it was once described by Amazon’s Werner Vogel, “Break everything to see how your systems respond.”
Small is beautiful
Start small and limit the blast radius of your experiments. That includes taking into consideration when the experiment runs, and which departments and resources are available after the experiment runs. By now, I hope it is clear that when we talk about chaos engineering, it’s never about cutting a cable or unplugging a machine randomly to see what happens. The goal is to prove a hypothesis. Even when fault tolerance is within acceptable margins, there are always insights to be gained from examining how the system responded.
The environment matters
If running experiments in a full production environment feels like a step too far into the abyss, that’s ok. For an organization’s baby steps in chaos engineering, production may be too risky. In this case, they should start in a different environment, but one that is as close to the production environment as possible. Quite simply, the findings will not be sufficiently relevant to shed light on potential failures of the system unless the environment is very similar.
Keep going
Software and systems are continuously being tweaked, so chaos engineering experiments should mirror this. It is not safe to assume that if a system responded to a fault injection test (FIT) in a particular way a month ago, the same holds true today. Many of these experiments can be automated, which enables engineers to focus on increasing the scope, intensity, and variety of tests.
Expanding efforts
Once you’ve tested the system for one type of fault, it’s time to adapt the hypothesis. It may also be time to try other hypotheses. Organizations that embark on chaos engineering sometimes get “stage fright” after the initial few tests, especially if these have been fairly minor. The thinking goes a little like this, “I don’t think there’s a problem in service X, but it’s too big a deal to risk.” Wrong!! Remember dark debt and the unforeseen anomalies inherent in complex systems? As Nora Jones from the original Netflix chaos engineering team has said, “Chaos engineering doesn’t cause problems. It reveals them.” Instead of getting cold feet when it matters most, organizations should absolutely tackle the big, important services, but do so in a careful, cautious way. When it comes to improving resiliency and confidence in systems, knowledge is power.
What’s your toughest digital challenge? If there is something you need to do, Apexon can help. Get in touch using the form below.
Also read: Digital Immunity Explained: Minimize Downtime and Boost System Reliability