AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

How to Setup a Failover Server on EC2

We started working on Amazon’s EC2 instances several years ago. Eventually we moved several of our hosted environments to the cloud and used scripts to backup the MySQL DB’s and file systems to S3. While the EC2 instances are pretty stable, like everything else they do occasionally fail. Since Amazon offers an elastic load balancer solution I started there. The setup is incredibly simple through the AWS UI and the cost is pretty reasonable at $0.025 per hour and $0.008 per GB. The problem with Amazon’s elastic load balancer solution is that you can’t associate an IP with it and can only address it by the domain name that Amazon has assigned. This prevents Amazon’s elastic LB from being able to be used for a primary domain. You can only use Amazon’s LB solution for sub-domains. This wasn’t acceptable so I started looking at at alternatives.

HAProxy was top of my list for an open source LB because of it’s ease of configuration, performance, and wide adoption. What I didn’t like this solution is that because it is in the path of traffic it requires two servers setup in HA mode, lest I cause more issues than I solve. This unfortunately doubles the cost of server instances. Additionally several environments that I was considering load balancing were running CMS systems not designed for active-active so without some hacking they would be running in active-passive mode. I started thinking about an alternative solution.

What I came up with was setting up a failover server with a script to monitor and control the failover execution. I believe this solution balances cost, complexity, and availability for small sites that are not critical, i.e. a company’s blog. If your site IS your business then you need to move forward with a properly load balanced, active-active solution.

The first thing you’ll need to do is to setup two additional servers. One is your replica or failover server that you’ll host your site/DB from when the primary fails. The second server is for monitoring and controlling the failover. For my failover server I used MySQL master-slave replication, which is pretty straight forward to setup and not going to be covered here. On the monitoring server my plan was to rely on Amazon’s AWS API tools to disassociate my IP and re-associate it with my failover server. In order to use these tools you need a JRE on your monitoring server. For setting this up I followed the instructions on this site.

Once you’ve setup the replica and monitoring servers, you need a script to monitor and control the failover. I used a bash shell script that curl’s the desired test page and greps for something that I know loads at the bottom of the page such as a Google analytics ID. If the load fails the script inserts the current timestamp into a file. If the page loads successfully it empties the file. The reason for this is that I didn’t want to alert or failover just because of one missed page load or because of missed page loads that were not sequential.

#!/bin/sh
FILE=akf_blog_err_cnt.txt
if curl -s http://mysite.com/ | grep -c UA-12345 > /dev/null 2>&1
then echo > $FILE
else echo $(date) >> $FILE
fi

The next step is to add the logic for counting the number of timestamps in the file.

ERR_CNT=0
while [ $ERR_CNT -lt $(cat $FILE | wc -l) ]
do
let ERR_CNT=ERR_CNT+1
done

Now compare that count to a maximum allowable number of failures. In my case if I don’t get a successful page response in 5 attempts I want to initiate the alert and failover. Since this script is designed to run via cron periodically and not as a persistent service, I’ve added a semaphore file to identify if the site has failed over. This will prevent the script from continuously trying to failover.

The actual failover control has a few steps. The first is to send out an email alert so that I know something has gone awry. The next is to stop the MySQL slave on the failover server. Since this is going to start taking traffic I don’t want it applying any more logs from the master. I’m using SSH with a key to execute a remote command. The last two steps are to disassociate the IP from the failed server and re-associate it to the failover server. These commands are part of the AWS API tool.

MAX_ERR=5
FAILED_FLAG=akf_blog_fail.txt
if [ $ERR_CNT -gt $MAX_ERR ]&&[ ! -f FAILED_FLAG ]
then
# Send email about failure
echo “The page did not loading more than $MAX_ERR times. Shifting to backup server.” | /bin/mail -s “Site NOT Loading” michael@akfpartners.com
# Stop slave
echo $( ssh -i /key.pem user@ec2-IP-address.amazonaws.com ‘mysql -Bse “stop slave” ‘)
# Shift IP to secondary server
echo $(ec2-disassociate-address 50.72.23.173)
echo $(ec2-associate-address 50.72.23.173 -i i-3950994)
# Mark as failed over
echo $(touch $FAILED_FLAG)
else echo “The test page has less than $MAX_ERR errors”
fi

Now, place this script in your cron jobs to run every minute. That’s it for setting up the failover monitor and control script. Because this monitoring server is not in the direct route of traffic I don’t need it setup as HA. A total failure of the system would require both the monitoring server and the primary site server to fail simultaneously. But because I’m pretty paranoid I do have an external monitoring service watching over the site and the monitoring server.


Comments RSS TrackBack 1 comment