We help businesses break things purposefully to discover and fix unknowns
We help businesses break things purposefully to discover and fix unknowns
Distributed systems contain a lot of moving parts. Environmental behavior is beyond your control. The moment you launch a new software service, you are at the mercy of the environment it runs in, which is full of unknowns.
Unpredictable events are bound to happen. Cascading failures often lie dormant for a long time, waiting for the trigger. The impact can be seen in revenue loss and operational costs as well as customer dissatisfaction, lost productivity, poor brand image, and even derailed IT careers.
No matter how you measure it, IT downtime is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures, and bare-metal infrastructure create many moving parts and potential points of failure, making those systems anything but predictable.
Apexon is a pure-play digital engineering services company. For over 18 years, we’ve focused on helping companies accelerate the pace and success of their digital efforts from concept-to-market. We apply our digital native perspective to help businesses rethink work processes, modernize customer engagement, and leverage digital technology to create a competitive advantage.
Apexon offers Chaos Engineering solutions to help organizations build fault-tolerant and robust cloud-native and multi-cloud applications to accelerate digital transformation. These services are an integral part of progressive delivery, based on the principle of experimenting with new functionality on distributed systems in order to test true performance. Our approach is built on the principles of continuous testing and software test automation and is segmented by the infrastructure, network, and application layers.
Apexon Chaos Engineering services deliver fault-tolerant and robust complex digital systems with important business advantages including:
Weaknesses
addressed proactively
Accelerated
innovation
Business
continuity
Operational
efficiency
Reduced SLA
breaches
Improved revenue
outcomes
Apexon follows a disciplined process with several key steps that dictate how we design Chaos Engineering experiments. The degree to which we can adhere to these steps correlates directly with the confidence we can have in a distributed system at scale.
Identify metrics and values to define steady state of system
Hypothesize it will work well for control group and experimental group
Introduce variables that reflect real world events like servers that crash, dependencies that fail, etc.
Stimulate environment using introduced variables to disapprove set hypothesis
Manage the blast radius by ensuring that the fallout from experiments is minimized and contained
Ensuring distributed application continues to work in an event of a host failure and upscales and downscale as per the required configuration
Simulate network latency, bandwidth, and jitter to verify the resilient programming of distributed applications
Simulate application crashes via exceptions or process kills
Apexon helped the customer in designing a microservices platform based on a popular container orchestration engine.
Being a telecommunications provider, the most critical aspect of their service is SLA.
Platform components recovered within 2-4 mins. As they were stateful components, the failure count wasn’t stretched beyond the quorum
The AWS autoscaling group replaced the killed instance with the new instance within 2-5 mins. Container orchestration platform started scheduling containers to this new instance
We experimented with CPU and memory resource exhaustion and noticed performance degradation on those VMs
We also found that container orchestration platform stopped scheduling containers to those instances due to resource saturation
Apexon helped this customer in designing and developing Pythom SDK.
SDK code was responsible for downloading or uploading terabytes of data. It was critical for SDK to work even under inconsistent network conditions. The team proactively developed and verified the following experiments:
Increasing latency resulted in more time from uploading or downloading the blob
We experimented with a decrease in bandwidth, which resulted in more time for uploading or downloading the blob
SDK continued upload or download from the stalled point for the blob whenever timeout was introduced
SDK continued to retry with the pre-defined attempts and resumed from the stalled point whenever the API crash was stimulated