AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Amazon.com

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.


Comments Off on Cascading Failures

Rules for Surviving an Amazon Outage

Because of recent issues with Amazon’s services there is a lot of interest in why some companies are able to keep their site up despite their IaaS or PaaS providers experiencing issues. Here is an InformIT article we wrote, outlining a few rules for surviving an Amazon or other cloud provider outage.


4 comments

How to Setup a Failover Server on EC2

We started working on Amazon’s EC2 instances several years ago. Eventually we moved several of our hosted environments to the cloud and used scripts to backup the MySQL DB’s and file systems to S3. While the EC2 instances are pretty stable, like everything else they do occasionally fail. Since Amazon offers an elastic load balancer solution I started there. The setup is incredibly simple through the AWS UI and the cost is pretty reasonable at $0.025 per hour and $0.008 per GB. The problem with Amazon’s elastic load balancer solution is that you can’t associate an IP with it and can only address it by the domain name that Amazon has assigned. This prevents Amazon’s elastic LB from being able to be used for a primary domain. You can only use Amazon’s LB solution for sub-domains. This wasn’t acceptable so I started looking at at alternatives.

HAProxy was top of my list for an open source LB because of it’s ease of configuration, performance, and wide adoption. What I didn’t like this solution is that because it is in the path of traffic it requires two servers setup in HA mode, lest I cause more issues than I solve. This unfortunately doubles the cost of server instances. Additionally several environments that I was considering load balancing were running CMS systems not designed for active-active so without some hacking they would be running in active-passive mode. I started thinking about an alternative solution.

What I came up with was setting up a failover server with a script to monitor and control the failover execution. I believe this solution balances cost, complexity, and availability for small sites that are not critical, i.e. a company’s blog. If your site IS your business then you need to move forward with a properly load balanced, active-active solution.

The first thing you’ll need to do is to setup two additional servers. One is your replica or failover server that you’ll host your site/DB from when the primary fails. The second server is for monitoring and controlling the failover. For my failover server I used MySQL master-slave replication, which is pretty straight forward to setup and not going to be covered here. On the monitoring server my plan was to rely on Amazon’s AWS API tools to disassociate my IP and re-associate it with my failover server. In order to use these tools you need a JRE on your monitoring server. For setting this up I followed the instructions on this site.

Once you’ve setup the replica and monitoring servers, you need a script to monitor and control the failover. I used a bash shell script that curl’s the desired test page and greps for something that I know loads at the bottom of the page such as a Google analytics ID. If the load fails the script inserts the current timestamp into a file. If the page loads successfully it empties the file. The reason for this is that I didn’t want to alert or failover just because of one missed page load or because of missed page loads that were not sequential.

#!/bin/sh
FILE=akf_blog_err_cnt.txt
if curl -s http://mysite.com/ | grep -c UA-12345 > /dev/null 2>&1
then echo > $FILE
else echo $(date) >> $FILE
fi

The next step is to add the logic for counting the number of timestamps in the file.

ERR_CNT=0
while [ $ERR_CNT -lt $(cat $FILE | wc -l) ]
do
let ERR_CNT=ERR_CNT+1
done

Now compare that count to a maximum allowable number of failures. In my case if I don’t get a successful page response in 5 attempts I want to initiate the alert and failover. Since this script is designed to run via cron periodically and not as a persistent service, I’ve added a semaphore file to identify if the site has failed over. This will prevent the script from continuously trying to failover.

The actual failover control has a few steps. The first is to send out an email alert so that I know something has gone awry. The next is to stop the MySQL slave on the failover server. Since this is going to start taking traffic I don’t want it applying any more logs from the master. I’m using SSH with a key to execute a remote command. The last two steps are to disassociate the IP from the failed server and re-associate it to the failover server. These commands are part of the AWS API tool.

MAX_ERR=5
FAILED_FLAG=akf_blog_fail.txt
if [ $ERR_CNT -gt $MAX_ERR ]&&[ ! -f FAILED_FLAG ]
then
# Send email about failure
echo “The page did not loading more than $MAX_ERR times. Shifting to backup server.” | /bin/mail -s “Site NOT Loading” michael@akfpartners.com
# Stop slave
echo $( ssh -i /key.pem user@ec2-IP-address.amazonaws.com ‘mysql -Bse “stop slave” ‘)
# Shift IP to secondary server
echo $(ec2-disassociate-address 50.72.23.173)
echo $(ec2-associate-address 50.72.23.173 -i i-3950994)
# Mark as failed over
echo $(touch $FAILED_FLAG)
else echo “The test page has less than $MAX_ERR errors”
fi

Now, place this script in your cron jobs to run every minute. That’s it for setting up the failover monitor and control script. Because this monitoring server is not in the direct route of traffic I don’t need it setup as HA. A total failure of the system would require both the monitoring server and the primary site server to fail simultaneously. But because I’m pretty paranoid I do have an external monitoring service watching over the site and the monitoring server.


1 comment

Setting Up CloudFront with an Origin Server

We have a couple of sites hosted on Amazon’s EC2 and I wanted to implement the CDN product from Amazon called CloudFront to see what performance improvements we could achieve. Having setup other CDN’s for sites I figured this would be a pretty straightforward setup, not worthy of a post. Unfortunately, this turned out to not be the case and thus I thought I should write something up for anyone else interested in a similar setup.

As background, a CDN (Content Delivery Network) is used to host mostly static content (files that don’t change often) on what are called “edge servers” instead of just your servers, called origin servers. Typically there are many hundreds or thousands of edge servers that are geographically distributed across multiple backbone providers. This makes them much closer to your customers resulting in faster download of your files to their browsers and thus better page performance while on your site.

CloudFront is designed to use Amazon’s S3 storage as its source for objects (static files like images or videos). I didn’t want to pay for the additional storage, although it is very cheap, but most importantly I did not want another failure point in the architecture. This setup might also be useful for sites not hosted on EC2 but wanting to use CloudFront. Wanting CloudFront to pull objects directly from my server I went looking for how others had solved this problem. It turns out this is possible to setup a CloudFront “distribution” (a term Amazon uses to refer to an implementation) using an origin server instead of S3 but only through Amazon’s CloudFront API, documentation here. Once the distribution is setup you can adminster it from the AWS web interface.

I started playing with the API using CURL but realized after a few attempts that the process was a little more complicated and in order to have something repeatable I’d need to write a little code. Since I had already borrowed the HMAC-SHA1 function, required for API authorization, from here which was in PHP, I continued with PHP. Here is the complete program if you’re interested but below are the major steps.

Major Steps
Here are the major steps in the program.

1) Define XML Payload: Using the “DistributionConfig” method, you set the “CustomOrigin” instead of “S3” and define the following variables:

  • DNSName – this is the domain you are setting up the CDN for.
  • HTTPPort – what ports are your secure and unsecure traffic on?
  • CNAME – what subdomain will you use in DNS to refer to the CDN? I used “cdn1.akfpartners.com” because I planned on changing all my references to static items (images, js, css, etc) to call this subdomain.
  • Enabled – do you want this enabled right away?
  • CallerReference – this is an ID to keep your requests unique.
  • DefaultRootObject – this is the default file that will be requested if no file is explicitly called.

2) Encode Authorization String: The CloudFront API requires that you encode the date formatted as such “Thu, 30 Dec 2010 16:05:21 EST” using HMAC-SHA1 with your secret access key.

3) Set Headers: The most important header is the “Authorization” header that requires the following format “Authorization: AWS public_access_key:encoded_date”.

4) Set CURL Options: There are a few CURL options that are required

  • URL – the URL to be called is “https://cloudfront.amazonaws.com/2010-11-01/distribution/”
  • POST – the API is a REST so you need to set the CURL to POST
  • TIMEOUT – how long before the request times out

5) Execute API Request: I wrapped the request in microtime calls to see how long the transaction took and captured the results of the request.

6) Parse Results: If successful the result will be a 201 reference meaning “created”. Otherwise there are a bunch of errors that can be sent back.

Once your program is ready just execute it and hopefully you get back a 201. Once you’re successful jump into the AWS console and you should see your distribution being created. It usually takes about 5 minutes until the distribution is completely ready.

DNS & Application Changes
The next step is to setup your DNS to use this CloudFront distribution. In the AWS console you will see the URL that Amazon has assigned to your CDN distribution, something like “d75x0jxgmx7op.cloudfront.net”. Simply take that URL and create a CNAME through your DNS provider to point your subdomain to the Amazon URL. My entry looked like this:

cdn1.akfpartners.com Alias (CNAME) d75x0jxgmx7op.cloudfront.net

Once you have DNS setup and propogated, remember that depending on your DNS provider’s TTL this might take 24 hrs or more, then you can change your application’s reference to static images. For the sites that I was implementing this for we used MediaWiki, Expression Engine (EE), and WordPress. The wiki just required a change to one PHP file, LocalSettings.php. For EE it took a change to a CSS file and in several templates replacing the {site_url} with a reference to the CNAME. For WordPress there is a plugin that helps with this reference replacement if you don’t want to hack the file by hand.

That’s it! Your site should now be up and running with Amazon’s CloudFront CDN.

Was There a Peformance Improvement?

This is really the big question, was this exercise, slightly more than point & click that I thought it would be, worth it? Well the wiki that I set this up for was ridiculously fast already and it had almost no images so the results weren’t that impressive. Our site, akfpartners.com, was already pretty fast as well but it does contain numerous images, JS, and CSS files. Using webpagetest.org, I ran the test several times averaging the results. The table below shows the results.

Here is a screenshot of the output of WebPageTest.org for a run with CloudFront enabled. Notice that it assigned us an “A” for use of a CDN whereas before we received an “F”.

(Click to Enlarge)

A 6.1% improvement doesn’t seem like that much until you consider Google’s statement that decreasing web search latency from 400 ms to 100 ms increases the daily number of searches per user by up to 0.6%. Increasing your site’s speed by just a small amount can have significant increases in repeat visitors and time on site.

Good luck with your CloudFront implementation.


4 comments