Tackle Chaos with Resilient Systems!

We help businesses break things purposefully to discover and fix unknowns

Round The Clock Cloud Assurance In A Multi-cloud Environment Is Challenging

Distributed systems contain a lot of moving parts. Environmental behavior is beyond your control. The moment you launch a new software service, you are at the mercy of the environment it runs in, which is full of unknowns.

Unpredictable events are bound to happen. Cascading failures often lie dormant for a long time, waiting for the trigger. The impact can be seen in revenue loss and operational costs as well as customer dissatisfaction, lost productivity, poor brand image, and even derailed IT careers.

Cloud Assurance & Cloud Security Management

No matter how you measure it, IT downtime is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures, and bare-metal infrastructure create many moving parts and potential points of failure, making those systems anything but predictable.

Seamless Chaos Engineering With The Latest Digital Technologies

Apexon is a pure-play digital engineering services company. For over 18 years, we’ve focused on helping companies accelerate the pace and success of their digital efforts from concept-to-market. We apply our digital native perspective to help businesses rethink work processes, modernize customer engagement, and leverage digital technology to create a competitive advantage.

Apexon offers Chaos Engineering solutions to help organizations build fault-tolerant and robust cloud-native and multi-cloud applications to accelerate digital transformation. These services are an integral part of progressive delivery, based on the principle of experimenting with new functionality on distributed systems in order to test true performance. Our approach is built on the principles of continuous testing and software test automation and is segmented by the infrastructure, network, and application layers.

Outcomes We Deliver

Apexon Chaos Engineering services deliver fault-tolerant and robust complex digital systems with important business advantages including:

Weaknesses
addressed proactively

Accelerated
innovation

Business
continuity

Operational
efficiency

Reduced SLA
breaches

Improved revenue
outcomes

APPROACH & PROCESS

A disciplined process

Apexon follows a disciplined process with several key steps that dictate how we design Chaos Engineering experiments. The degree to which we can adhere to these steps correlates directly with the confidence we can have in a distributed system at scale.

Identify metrics and values to define steady state of system

Hypothesize it will work well for control group and experimental group

Introduce variables that reflect real world events like servers that crash, dependencies that fail, etc.

Stimulate environment using introduced variables to disapprove set hypothesis

Manage the blast radius by ensuring that the fallout from experiments is minimized and contained

FOCUS
AREAS

Infrastructure

Ensuring distributed application continues to work in an event of a host failure and upscales and downscale as per the required configuration

Network

Simulate network latency, bandwidth, and jitter to verify the resilient programming of distributed applications

Application

Simulate application crashes via exceptions or process kills

APPLYING CHAOS ENGINEERING TO DELIVER INNOVATION & RESULTS FOR OUR CLIENTS

LEADING TELECOMMUNICATIONS SERVICE PROVIDER

Apexon helped the customer in designing a microservices platform based on a popular container orchestration engine.

Being a telecommunications provider, the most critical aspect of their service is SLA.

Platform components recovered within 2-4 mins. As they were stateful components, the failure count wasn’t stretched beyond the quorum

The AWS autoscaling group replaced the killed instance with the new instance within 2-5 mins. Container orchestration platform started scheduling containers to this new instance

We experimented with CPU and memory resource exhaustion and noticed performance degradation on those VMs

We also found that container orchestration platform stopped scheduling containers to those instances due to resource saturation

DATA INFORMATICS

Apexon helped this customer in designing and developing Pythom SDK.

SDK code was responsible for downloading or uploading terabytes of data. It was critical for SDK to work even under inconsistent network conditions. The team proactively developed and verified the following experiments:

Increasing latency resulted in more time from uploading or downloading the blob

We experimented with a decrease in bandwidth, which resulted in more time for uploading or downloading the blob

SDK continued upload or download from the stalled point for the blob whenever timeout was introduced

SDK continued to retry with the pre-defined attempts and resumed from the stalled point whenever the API crash was stimulated

MEET MADNESS & CHAOS HEAD ON WITH HIGHLY RESILIENT SYSTEMS