One of the key components to high availability is the proper management of risk. Obviously stopping all changes to a site will improve the availability in the short run but in the long run, no new features get deployed so customers stop coming and no required maintenance gets done so you end up with more down time to fix what could have been handled with preventative maintenance. So that is not a good long term strategy for improving availability. Luckily, we have found and proven to ourselves that It is possible to dramatically improve a site’s availability by simply understanding the riskiness of changes. By taking the time to classify changes on a scale of risk and using some basic rules, the management of risk becomes a very useful tool in the quest for uptime.
The tool that we recommend using comes originally from the military and space program but we were taught it as part of Six Sigma and is called a Failure Mode and Effects Analysis or FMEA (pronounced fee’-ma). The first step is to identify the ways in which the code or application can fail, or in our terminology the “failure modes”. We typically asked each engineer or product manager to come up with three to five failure modes for each feature. Once these are gathered the team should rate each failure mode using these questions: how severe is the impact of the failure (severity), how detectable is it if it fails (detectability, yes we made up that word), what is the probability or “likelihood” that the failure will occur We recommend using a scale of 1, 3, and 9, because it provides an exponential weighting to help identify the riskier items. These scores are then multiplied together to get a total risk score and the failure modes are ranked highest risk to lowest. Here is an example risk matrix.
So, once you have developed this matrix, what can you do with it? For starters, the highest risk items should always have mitigation plans associated with them. These mitigation plans are actions that help lower one or all of the three risks (severity, detectability, or likelihood). The second thing that you can do with this matrix is determine a maximum level of risk that you will allow to be placed upon the site in any given time period (1 day, 1 week, or 1 release are all good intervals or timeframes). As an example, you might determine either as a guess at first or later through analysis of past performance associated with the risk score of each release, that 275 is the maximum risk amount that you feel comfortable with for any release or change to the site within 1 week. Therefore you can only have features in this week’s release that total less than 275 on the risk scale. Lastly, this should be used in conjunction with other risk mitigation strategies, such as not mixing infrastructure changes with code releases.