AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Achieving Maximum Availability in the Cloud

We often hear clients confidently tell us that Amazon’s SLA of 99.95% for EC2 instances provides them with plenty of availability. They believe that the combination of auto scaling compute instances with a persistent data store such as RDS provides all the scalability and availability that they will ever need. This unfortunately isn’t always the case. In this article we will focus on AWS services but this applies to any IaaS or PaaS provider.

The SLA of 99.95% is not the guaranteed availability of your services. It’s the availability of the EC2 for an entire region in AWS. From the AWS EC2 SLA agreement,

“Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes during the month in which Amazon EC2 or Amazon EBS, as applicable, was in the state of “Region Unavailable.”

Even this guaranteed limited downtime by Amazon of 21.56 minutes per month is not always achieved. Over the course of the last few years there have been several outages that lasted much longer. Additionally, your services availability can be impacted from other third parties services used in your product and an even more likely from simple human error by your engineering team.

To combat and reduce the likely of a customer-impacting event or performance degradation, we recommend various deployment patterns that will help.

  • Calls made across Regions should be done so asynchronously. This reduces latency and the likelihood of a failure.
  • A given Service should be completely deployed within multiple Availability Zone.
  • Use abstraction layers with AWS Managed Services so that your product is not tied to such services. Today these managed services are at different stages of maturity and an abstraction layer will allow for a migration to more robust solutions later.
  • Services, or microservices depending on your architecture, should have a dedicated data store. Allowing app server pools to communicate to multiple data stores reduces availability.
  • To protect against region failures deploy all services needed for a product into a second region and run active / active designating a home for each user.

We have come across clients who have invested in migrating from one cloud provider to another because of what seemed like infrastructure outages in a particular region. Before making such an investment of time and money make sure you are not the root cause of these outages because the way your product is architected. If you are concerned about your architecture or deploying your product in the cloud, contact us and we can provide you with some options.