O’Reilly.com
Approaches, best practices & case studies
O’Reilly.com
Eight Fallacies of Modern-Day Distributed Computing
When IT infrastructure, networks, or applications unexpectedly fail or crash, it can have a significant impact on the business.
The actual cost varies greatly by business or organization, but just a few years ago, Gartner estimated the damage at ranges from a low of $140,000 to a high of $540,000 per hour. The impact can be seen in revenue loss and operational costs as well as customer dissatisfaction, lost productivity, poor brand image, and even derailed IT careers.
No matter how you measure it, IT downtime is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures, and bare-metal infrastructure create a lot of moving parts and potential points of failure, making those systems anything but predictable.
Distributed systems contain a lot of moving parts. Environmental behavior is beyond your control. The moment you launch a new software service, you are at the mercy of the environment it runs in, which is full of unknowns. Unpredictable events are bound to happen. Cascading failures often lie dormant for a long time, waiting for the trigger.
Chaos Engineering is a new approach to software development and testing designed to eliminate some of that unpredictability by putting that complexity and interdependence to the test.
The idea is to perform controlled experiments in a distributed environment that help you build confidence in the system’s ability to tolerate the inevitable failures. In other words, break your system on purpose to find out where the weaknesses are. That way, you can fix them before they break unexpectedly and hurt the business and your users.
As a result, you will better understand how your IT systems really behave when they fail. You can exercise contingency plans at scale to ensure those plans work as designed. Chaos Engineering services also provides the ability to revert systems back to their original states without impacting users. It also saves a lot of time and money that would be spent responding to systems outages.
A specific approach to testing known conditions.
Assertion: given specific conditions, a system will emit a specific output.
Tests are typically binary; determine whether a property is true or false.
A practice for generating new information.
More exploratory in nature with unknown outcomes.
Tests effects of various conditions; generates more subjective information.
Any system is as strong as its weakest point. Chaos Engineering practices help identify weak points of the complex system pro-actively.
The purpose is not to cause problems or chaos. It is to reveal them before they cause disruption so you can ensure higher availability.
The more chaos experiments (tests) you do, the more knowledge you generate on system resilience. This helps minimalize downtime, thereby reducing SLA breaches and improving revenue outcomes.
At Apexon, we believe that a key element in Continuous Testing is monitoring and testing throughout the development, deployment and release cycles. Chaos Engineering integrated in DevOps value chains plays a vital role in achieving this.
There are a number of different tools available to support your Chaos Engineering efforts.
Which ones you use depends on the size of your environment and how automated you want the process to be. Below are just a few to be aware of.
Tests IT infrastructure resilience.
Provides tools to orchestrate chaos on Kubernetes to help SREs find bugs and vulnerabilities in both staging and production.
Enables experimentation at different levels: infrastructure, platform and application.
Is a “failure-as-a-service” platform built to make the Internet more reliable. It turns failure into resilience by offering engineers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Simulates network conditions to support deterministic tampering with connections, with support for randomized chaos and customization. It can determine if an application has a single point.
Anticipate production failures and mitigate them by simulating failure of virtual instances, availability zones, regions, etc
Primarily done in production or production-like environments
Chaos Monkey
Litmus
ToxiProxy
Swabbie (Formerly Janitor Monkey)
Conformity Monkey (Now part of Spinnaker)
Chaos Lambda (lower scale)
Simian Army is deprecated, and tools are being made part of Spinnaker
Chaos Monkey does not support deployments that are managed by anything other than Spinnaker
No abort or roll back function available
Limited support/coordination
No UI
Ensure App doesn’t have single points of failure; simulate network and system conditions supporting deterministic tampering with connections, but with support for randomized chaos and customization
Simulate network degradation/intermittent connectivity; how applications behave in these conditions early during development
Mobile apps with offline functionality
SPA web apps that work without network connectivity
Latency (with optional jitter)
Complete service unavailability
Reduced bandwidth
Timeouts
Slow-to-close connections
Piecemeal information, with more optional delays
Instill Chaos Engineering principles early in the development stage; build for resilience and stability
Developers & SDETs primarily lead this activity in this stage but consult and involve business/product owners for expected results. Ops can also be consulted or informed
Dev Environment/Local Machine
Observe component/service under test behavior in the absence of a dependent service in another docker container.
Tools: Docker, KubeMonkey
Lower-Level Environments
Introduce chaos at container level: killing, stopping, and removing running containers.
Tools: Pumba (similar to Chaos Monkey but works at container level)
Mimic service failures and latency between service calls
Tools: Service Mesh like Istio and Chaos Monkey for Spring Boot
The Maturity Model below provides a map for software delivery teams getting started with Chaos Engineering and evolving their use of it over time. It’s a useful way to track your progress and compare yourself to other organizational adopters.
Apexon uses this model when we work with clients to layout the most effective approach that will deliver the most productive results.
Apexon follows a disciplined process with several key steps that dictate how we design Chaos experiments. The degree to which we can adhere to these steps correlates directly with the confidence we can have in a distributed system at scale.
Identify metrics and values to define steady state of system
Hypothesize it will work well for control group and experimental group
Introduce variables that reflect real world events like servers that crash, dependencies that fail, etc.
Stimulate environment using introduced variables to disapprove set hypothesis
Manage the blast radius by ensuring that the fallout from experiments are minimized and contained
With those steps above as our roadmap, the workflow outlined below insures that critical Chaos experiment information is passed along at each stage and informing the next.
Experimental Lifecycle & Best Practices
Apexon helped the customer in designing Microservices Platform based on popular container orchestration engine.
Being a telecommunications provider most critical aspect of their service is SLA.
Platform components recovered within 2-4 mins. As they were stateful components, failure count wasn’t stretched beyond the quorum.
The AWS autoscaling group replaced the killed instance with the new instance within 2-5 mins. Container orchestration platform started scheduling containers to this new instance.
We experimented with CPU and memory resource exhaustion and did notice performance degradation on those VMs. We also found that container orchestration platform stopped scheduling containers to those instances due to resource saturation.
Apexon helped this customer in designing and developing Pythom SDK.
SDK code was responsible for downloading or uploading Terabytes of data. It was critical for SDK to work even under inconsistent network conditions. Team proactively developed and verified the following experiments:
Chaos testing is a practice in software development and system administration where deliberate and controlled disruptions are introduced into a system to observe how it behaves under stressful conditions. The purpose of chaos testing is to proactively identify weaknesses or vulnerabilities in a system’s design or architecture before they manifest in real-world scenarios, ultimately leading to more resilient and reliable systems.
There are several Chaos testing tools available in the market catering to different needs and preferences of developers and system administrators. Some popular Chaos testing tools include Chaos Monkey, Gremlin, Pumba, Chaos Toolkit, Litmus, and ChaosBlade, among others. These tools offer various features such as fault injection, latency injection, network partitioning, and chaos orchestration to simulate different types of failures and disruptions in a system.
While both Chaos testing and Chaos engineering aim to improve system resilience and reliability, they differ in their scope and approach. Chaos testing is a specific practice within Chaos engineering, focusing on intentionally injecting failures and disruptions into a system to observe its behavior under stress. On the other hand, Chaos engineering is a broader discipline that encompasses not only chaos testing but also the principles, practices, and culture around building and operating resilient systems. Chaos engineering emphasizes creating a culture of experimentation, automation, and learning from failures to build more robust systems over time.
While Chaos testing and performance testing share some similarities, they serve different purposes and focus on different aspects of system behavior. Performance testing primarily evaluates how a system performs under normal conditions, focusing on metrics such as response time, throughput, and resource utilization. On the other hand, Chaos testing specifically aims to assess how a system behaves under abnormal or stressful conditions by intentionally introducing failures, latency, or other disruptions. While Chaos testing can uncover performance-related issues, its primary goal is to identify weaknesses in system resilience rather than performance optimization.
Examples of Chaos testing scenarios include introducing network latency or packet loss, simulating hardware failures such as disk or CPU failures, inducing service unavailability or timeouts, triggering resource exhaustion such as memory or disk space, and simulating unexpected spikes in traffic or load. By subjecting a system to these controlled disruptions, organizations can gain insights into how their systems behave under adverse conditions and identify potential weaknesses or vulnerabilities that need to be addressed to improve overall resilience.