AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » change management

How To Restore Service in Less Than 5 Minutes

What’s the first thing you do when your site is down? For most people they pull up Nagios, or the like, and check all the servers, databases, and storage systems. Someone else might start tail’ing or grep’ing the log files. Tech executives by now are answering phone calls or sending email updates about the outage and expected downtime. Software developers are called in go over the log files in more detail and network engineers are asked to jump on devices to make sure they are responding properly.

What’s missing from the above scenario? Nobody looked up the last change that went into production. In our experience, 90+% of the problems in production are caused by the latest change, be it a code release, firewall change, or applying DDL or DML to the database. And it’s a sure bet that latest change is the problem if the person who made it says “That couldn’t have caused the outage.” In fact there is probably a high degree of correlation between how emphatically they make their statement and the probability that it is the cause of the incident.

Just the other day one of our friends had an outage call where the network security team was arguing that their latest change could not have possibly caused the outage. Guess what caused the outage…that’s right the firewall change.

So, how do you solve 90+% of your problems in less than 5 minutes? You immediately rollback the last change you made to your production environment. You might be saying to yourself “But how can I do that when I don’t know all the changes that are happening in my production environment?” And that (as Paul Harvey used to say) is the rest of the story.

You have to keep track of every single change that takes place in your production environment. This is called “change tracking” and is different from “change management”. Change tracking is simply keeping track, in any format, of all the changes that happen in production. These changes can be kept in a word document, spreadsheet, database, IRC channel, or even an unmonitored email account. Anything that 1) allows fast entry, so people have no excuse to not use it, and 2) can be retrieved immediately when needed during an outage.


1 comment

Log Every Change

It's 5 PM, do you know what the last 4 changes in your production environment were? You'd better!

In well run technology organizations, any event that has the potential of impacting customers will trigger an alert that brings a cross disciplinary team together in person or on the phone to start troubleshooting the potential (or actual) problem.  Ideally the person responsible for running the incident management and problem resolution process will ask what most recently changed and then listen (or read) as the operations team reads (or displays) the change log.  We often joke that you only need to wait for someone to say “Yeah, but that change couldn’t possibly have caused this issue” to find the root cause and fix the problem.

In our experience, changes are one of the most common cause o f customer and revenue impacting issues.  Sometimes these changes are feature enhancements or functionality additions, and sometimes they are infrastructure or architectural changes.  Very often, they are simple configuration changes like an addition of a range of IP Addresses to an access control list, or the modification of DNS.  In some companies, these changes (identified as any modification to a production environment other than that made by the actual software or system itself) happen at a rate of several thousand per day.  It is virtually impossible to track them unless a change logging system is put in place.  Very often, it is the change that is undocumented and therefore difficult to isolate and roll back that costs the company the greatest downtime or revenue.

Too many companies allow too many changes to go undocumented.  The most commonly cited reason for a lack of change logging is that it simply takes too long to log each and every change.  But change logging doesn’t have to be cumbersome and it need not always include the notion of risk management inherent to a change management system.  Just the logging of a change for later identification can save between hundreds and millions of dollars of revenue and hundreds or thousands of customers – especially in a SaaS environment.  Something as simple as always logging the time, date, reason for a change, person making the change and the system being modified can make a world of difference.  Many web enabled tools offered by companies like Service-Now make such logging very simple.  Most tools offer smtp interfaces that allow people to make a change and email it to the system.  For a minute or two of time per change, hours can be saved in customer impact.

Log your changes – every change, every time.


3 comments