Approaches, best practices & case studies

White paper
Chaos Engineering

Approaches, best practices & case studies

“Chaos Engineering is the discipline of experimenting on a distributed system in order to induce artificial failures to build confidence in the system’s capability to withstand turbulent conditions in production.”

O’Reilly.com

INTRODUCTION

The Cost of System Downtime

Eight Fallacies of Modern-Day Distributed Computing

The network is reliable

Bandwidth is infinite

Topology doesn’t change

Transport cost is zero

Latency is zero

The network is secure

There is one administrator

The network is homogeneous

When IT infrastructure, networks, or applications unexpectedly fail or crash, it can have a significant impact on the business.

The actual cost varies greatly by business or organization, but just a few years ago, Gartner estimated the damage at ranges from a low of $140,000 to a high of $540,000 per hour. The impact can be seen in revenue loss and operational costs as well as customer dissatisfaction, lost productivity, poor brand image, and even derailed IT careers.

No matter how you measure it, IT downtime is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures, and bare-metal infrastructure create a lot of moving parts and potential points of failure, making those systems anything but predictable.

Distributed systems contain a lot of moving parts. Environmental behavior is beyond your control. The moment you launch a new software service, you are at the mercy of the environment it runs in, which is full of unknowns. Unpredictable events are bound to happen. Cascading failures often lie dormant for a long time, waiting for the trigger.

Chaos Engineering

Chaos Engineering is a new approach to software development and testing designed to eliminate some of that unpredictability by putting that complexity and interdependence to the test.

The idea is to perform controlled experiments in a distributed environment that help you build confidence in the system’s ability to tolerate the inevitable failures. In other words, break your system on purpose to find out where the weaknesses are. That way, you can fix them before they break unexpectedly and hurt the business and your users.

As a result, you will better understand how your IT systems really behave when they fail. You can exercise contingency plans at scale to ensure those plans work as designed. Chaos Engineering services also provides the ability to revert systems back to their original states without impacting users. It also saves a lot of time and money that would be spent responding to systems outages.

Build Testing vs. Chaos Engineering

Build Testing

A specific approach to testing known conditions.

Assertion: given specific conditions, a system will emit a specific output.

Tests are typically binary; determine whether a property is true or false.

Chaos Engineering

A practice for generating new information.

More exploratory in nature with unknown outcomes.

Tests effects of various conditions; generates more subjective information.

Any system is as strong as its weakest point. Chaos Engineering practices help identify weak points of the complex system pro-actively.

The purpose is not to cause problems or chaos. It is to reveal them before they cause disruption so you can ensure higher availability.

The more chaos experiments (tests) you do, the more knowledge you generate on system resilience. This helps minimalize downtime, thereby reducing SLA breaches and improving revenue outcomes.

At Apexon, we believe that a key element in Continuous Testing is monitoring and testing throughout the development, deployment and release cycles. Chaos Engineering integrated in DevOps value chains plays a vital role in achieving this.

TOOLING

There are a number of different tools available to support your Chaos Engineering efforts.

Which ones you use depends on the size of your environment and how automated you want the process to be. Below are just a few to be aware of.

Chaos MONKEY

Tests IT infrastructure resilience.

Litmus

Provides tools to orchestrate chaos on Kubernetes to help SREs find bugs and vulnerabilities in both staging and production.

Chaos Toolkit

Enables experimentation at different levels: infrastructure, platform and application.

Gremlin

Is a “failure-as-a-service” platform built to make the Internet more reliable. It turns failure into resilience by offering engineers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

Toxiproxy

Simulates network conditions to support deterministic tampering with connections, with support for randomized chaos and customization. It can determine if an application has a single point.

Apexon is currently powering the biggest Chaos Engineering community in India to help organizations build fault-tolerant and robust cloud-native applications to accelerate their digital initiatives.

Our approach is built on the principles of Continuous Testing and software test automation and is segmented by the Infrastructure, Network, and Application layers.

Approaches & Best Practices

Chaos Testing @ Infrastructure Layer

Purpose:

Anticipate production failures and mitigate them by simulating failure of virtual instances, availability zones, regions, etc

Primarily done in production or production-like environments

Chaos Engineering Tools from Netflix:

Chaos Monkey

Litmus

ToxiProxy

Swabbie (Formerly Janitor Monkey)

Conformity Monkey (Now part of Spinnaker)

Other Options:

Chaos Lambda (lower scale)

Also, consider a commercial tool like Gremlin:

Simian Army is deprecated, and tools are being made part of Spinnaker

Chaos Monkey does not support deployments that are managed by anything other than Spinnaker

No abort or roll back function available

Limited support/coordination

No UI

Chaos Testing @ Network Layer

Purpose:

Ensure App doesn’t have single points of failure; simulate network and system conditions supporting deterministic tampering with connections, but with support for randomized chaos and customization

Simulate network degradation/intermittent connectivity; how applications behave in these conditions early during development

Ideal for:

Mobile apps with offline functionality

SPA web apps that work without network connectivity

Toxiproxy:

Latency (with optional jitter)

Complete service unavailability

Reduced bandwidth

Timeouts

Slow-to-close connections

Piecemeal information, with more optional delays

Chaos Testing @ Application Layer

Purpose:

Instill Chaos Engineering principles early in the development stage; build for resilience and stability

Developers & SDETs primarily lead this activity in this stage but consult and involve business/product owners for expected results. Ops can also be consulted or informed

Use Cases:

Dev Environment/Local Machine

Observe component/service under test behavior in the absence of a dependent service in another docker container.

Tools: Docker, KubeMonkey

Lower-Level Environments
Introduce chaos at container level: killing, stopping, and removing running containers.

Tools: Pumba (similar to Chaos Monkey but works at container level)

Mimic service failures and latency between service calls

Tools: Service Mesh like Istio and Chaos Monkey for Spring Boot

Maturity Model

The Maturity Model below provides a map for software delivery teams getting started with Chaos Engineering and evolving their use of it over time. It’s a useful way to track your progress and compare yourself to other organizational adopters.

Apexon uses this model when we work with clients to layout the most effective approach that will deliver the most productive results.

Approach & Process

Apexon follows a disciplined process with several key steps that dictate how we design Chaos experiments. The degree to which we can adhere to these steps correlates directly with the confidence we can have in a distributed system at scale.

Identify metrics and values to define steady state of system

Hypothesize it will work well for control group and experimental group

chaos engineering principles and process

Introduce variables that reflect real world events like servers that crash, dependencies that fail, etc.

Stimulate environment using introduced variables to disapprove set hypothesis

Manage the blast radius by ensuring that the fallout from experiments are minimized and contained

Lifecycle Management

With those steps above as our roadmap, the workflow outlined below insures that critical Chaos experiment information is passed along at each stage and informing the next.

Experimental Lifecycle & Best Practices

Case Study one

Leading telecommunications service provider

Apexon helped the customer in designing Microservices Platform based on popular container orchestration engine.

Being a telecommunications provider most critical aspect of their service is SLA.

EXPERIMENT

WHAT IF KEY COMPONENTS LIKE ELASTICSEARCH, KAFKA OR REDIS ARE KILLED?

OUTCOME

Platform components recovered within 2-4 mins. As they were stateful components, failure count wasn’t stretched beyond the quorum.

EXPERIMENT

What if multiple instances in different autoscaling groups are randomly shutdown?

OUTCOME

The AWS autoscaling group replaced the killed instance with the new instance within 2-5 mins. Container orchestration platform started scheduling containers to this new instance.

EXPERIMENT

What if there is a sudden resource exhaustion on underlying VMs?

OUTCOME

We experimented with CPU and memory resource exhaustion and did notice performance degradation on those VMs. We also found that container orchestration platform stopped scheduling containers to those instances due to resource saturation.

Case Study two

Data Informatics

Apexon helped this customer in designing and developing Pythom SDK.

SDK code was responsible for downloading or uploading Terabytes of data. It was critical for SDK to work even under inconsistent network conditions. Team proactively developed and verified the following experiments:

FAQ’s – Chaos Testing

1. What is the meaning of Chaos testing?

Chaos testing is a practice in software development and system administration where deliberate and controlled disruptions are introduced into a system to observe how it behaves under stressful conditions. The purpose of chaos testing is to proactively identify weaknesses or vulnerabilities in a system’s design or architecture before they manifest in real-world scenarios, ultimately leading to more resilient and reliable systems.

2. What are the list of Chaos testing tools?

There are several Chaos testing tools available in the market catering to different needs and preferences of developers and system administrators. Some popular Chaos testing tools include Chaos Monkey, Gremlin, Pumba, Chaos Toolkit, Litmus, and ChaosBlade, among others. These tools offer various features such as fault injection, latency injection, network partitioning, and chaos orchestration to simulate different types of failures and disruptions in a system.

3. What is the difference between Chaos testing and Chaos engineering?

While both Chaos testing and Chaos engineering aim to improve system resilience and reliability, they differ in their scope and approach. Chaos testing is a specific practice within Chaos engineering, focusing on intentionally injecting failures and disruptions into a system to observe its behavior under stress. On the other hand, Chaos engineering is a broader discipline that encompasses not only chaos testing but also the principles, practices, and culture around building and operating resilient systems. Chaos engineering emphasizes creating a culture of experimentation, automation, and learning from failures to build more robust systems over time.

4. Is Chaos testing a part of performance testing?

While Chaos testing and performance testing share some similarities, they serve different purposes and focus on different aspects of system behavior. Performance testing primarily evaluates how a system performs under normal conditions, focusing on metrics such as response time, throughput, and resource utilization. On the other hand, Chaos testing specifically aims to assess how a system behaves under abnormal or stressful conditions by intentionally introducing failures, latency, or other disruptions. While Chaos testing can uncover performance-related issues, its primary goal is to identify weaknesses in system resilience rather than performance optimization.

5. What are some examples of Chaos testing?

Examples of Chaos testing scenarios include introducing network latency or packet loss, simulating hardware failures such as disk or CPU failures, inducing service unavailability or timeouts, triggering resource exhaustion such as memory or disk space, and simulating unexpected spikes in traffic or load. By subjecting a system to these controlled disruptions, organizations can gain insights into how their systems behave under adverse conditions and identify potential weaknesses or vulnerabilities that need to be addressed to improve overall resilience.

6. What is Chaos Engineering, and why is it important?

Chaos Engineering is a disciplined approach to proactively testing system resilience by introducing controlled failures in a production-like environment. It helps identify vulnerabilities, improve system reliability, and prevent unexpected outages. By simulating real-world failures, organizations can strengthen their infrastructure, minimize downtime, and enhance user experience.

7. How does Chaos Engineering improve system reliability?

Chaos Engineering improves system reliability by exposing weaknesses before they cause real disruptions. Through controlled experiments, teams can analyze system behavior under stress, detect failure points, and implement preventive measures. This proactive approach helps build more resilient applications, ensuring consistent performance even under unexpected conditions.

White paper Chaos Engineering

Approaches & Best Practices

Approach & Process

FAQ’s – Chaos Testing

1. What is the meaning of Chaos testing?

2. What are the list of Chaos testing tools?

3. What is the difference between Chaos testing and Chaos engineering?

4. Is Chaos testing a part of performance testing?

5. What are some examples of Chaos testing?

6. What is Chaos Engineering, and why is it important?

7. How does Chaos Engineering improve system reliability?

Download Resource

White paper
Chaos Engineering