How To Restore Service in Less Than 5 Minutes
What’s the first thing you do when your site is down? For most people they pull up Nagios, or the like, and check all the servers, databases, and storage systems. Someone else might start tail’ing or grep’ing the log files. Tech executives by now are answering phone calls or sending email updates about the outage and expected downtime. Software developers are called in go over the log files in more detail and network engineers are asked to jump on devices to make sure they are responding properly.
What’s missing from the above scenario? Nobody looked up the last change that went into production. In our experience, 90+% of the problems in production are caused by the latest change, be it a code release, firewall change, or applying DDL or DML to the database. And it’s a sure bet that latest change is the problem if the person who made it says “That couldn’t have caused the outage.” In fact there is probably a high degree of correlation between how emphatically they make their statement and the probability that it is the cause of the incident.
Just the other day one of our friends had an outage call where the network security team was arguing that their latest change could not have possibly caused the outage. Guess what caused the outage…that’s right the firewall change.
So, how do you solve 90+% of your problems in less than 5 minutes? You immediately rollback the last change you made to your production environment. You might be saying to yourself “But how can I do that when I don’t know all the changes that are happening in my production environment?” And that (as Paul Harvey used to say) is the rest of the story.
You have to keep track of every single change that takes place in your production environment. This is called “change tracking” and is different from “change management”. Change tracking is simply keeping track, in any format, of all the changes that happen in production. These changes can be kept in a word document, spreadsheet, database, IRC channel, or even an unmonitored email account. Anything that 1) allows fast entry, so people have no excuse to not use it, and 2) can be retrieved immediately when needed during an outage.