Cloud offers undeniable benefits when it comes to building cost-effective agile solutions. Enterprise-wide pushes to modernize application stacks have added fuel to cloud migration initiatives. As more and more monoliths are decomposed and redesigned using a distributed microservices architecture, the large portfolio of cloud-native services makes developing and deploying in the cloud faster and economical. However, users who build and plan for cloud assume the high availability of cloud services as advertised by the service provider. As such, they also presume the continued availability of their solution through a disaster. To put this in perspective, cloud migration and adoption of cloud-native services do not automatically help achieve “resiliency” in your application architecture. An unintended consequence of this assumption is an increase in the overall enterprise risk.
To better account for resiliency, cloud in itself should be considered as a separate entity with functional and non-functional aspects that should be defined, assessed and monitored separately. Most implementations account for optimum and economical use of cloud, with no pervasive planning on resiliency.
Having a resiliency-driven enterprise function or a focused team can help achieve better upfront assessments and planning for the application and physical architecture. Such a function can go a long way to striking the right balance between the agility that cloud provides and the organization’s tolerance for risk. If organizations fail to build and integrate a resiliency function into their application development lifecycle, they willingly accept the risk of unplanned downtime, especially when they are dealing with business-critical workloads.
Assessing this risk requires reliably quantifying the cost of downtime. According to some estimates “On average, an infrastructure failure can cost $100,000 an hour and a critical application failure can cost $500,000 to $1 million per hour.*” Businesses can’t afford repeat occurrences of unpredictable downtime events. As such, Resiliency Function is of paramount importance to any organization with business-critical workloads that are either developed in the cloud or are being re-platformed and migrated to cloud.
Resiliency function can help bring unprecedented business resiliency while simultaneously assessing, measuring, and rectifying any potential technical issues that can pose an inadvertent risk to any business-critical application. The team tasked with this function builds a hypothesis around edge cases that can affect application availability.
“At the end of this blog, I have posted links to articles detailing the consequences when company’s do not overemphasize on edge cases for critical workloads.”
Resiliency Approach in the Cloud
A resiliency focused team should critically look at all components that are tied to application design. Failure mode effect analysis (FMEA) is a good starting point in assessing failure intensity. The resiliency team should work with application, networking, security, and infrastructure architects to develop an interaction diagram of all components that are a part of the overall application stack. Each interaction point should then be assessed individually for all possible failures. Such failures should be scored against severity, observability, and probability.
The outcome of this exercise is the creation of a risk profile for each specific failure by calculating the risk priority number (RPN) using the above three specificities. The high-risk failures are the ones with high probability, high severity, and low observability. The resiliency team should then identify a finite set of high RPN failures that could be assessed and replicated via PoCs. From there, the resiliency team should then recommend potential solutions to different stakeholders based on observable output from the PoCs. In a more mature setup, resiliency teams can become an integral part of application development teams and can help build utilities and frameworks that specifically target failure points and bridge structural deficiencies in application code.
CHAOS and cloud service verification are two other offshoots of the resiliency function. CHOAS builds on failure hypothesis and carries out failure ingestions at all vulnerable sections of application architecture and its associated infrastructure. CHAOS outputs your application’s ability to sustain any possible inadvertent attacks and checks applications inbuilt defence against such attacks.
Ideally, CHAOS experiments are conducted in live production environments but if your organization is new to CHAOS testing, running it in a production-like environment (UAT, performance, etc.) is a good starting point.
Cloud service verification is another exercise that is specific to running deep targeted experiments on cloud-native services, which are a part of your application architecture. It includes understanding and end to end design of the service being provisioned and how it operates under a stress event. Any failure or issues encountered during this process are fixed by building custom utilities or tuning service configuration. Issue encountered during a service verification process should also be discussed in detail with the cloud provider.
An always-on world on the cloud comes with inherent risks. The complexity and impact of outages require detailed and focused attention. A typical application architecture on cloud has many components. Availability of such an application is an aggregate of availability for all components. Understanding the interaction points, building a failure mode for every possible failure scenario tied to that interaction point, and testing each of these scenarios to understand the impact and build solutions to address those gaps can provide significant efficiency to your disaster recovery plans.
At Perficient, we continue to monitor cloud technologies and trends. We do understand the challenges in embracing cloud technologies and hence have come up with proven cloud-based solutions, platforms, architectures, and methodologies to aid smoother migration. If you’re interested in learning more, please reach out to one of our specialists at firstname.lastname@example.org
References and links
*“IDC, “DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified.” Stephen Elliot. December 2014, IDC #253155”