We define risk as being comprised of three components: severity of impact, the probability of occurrence, and the ability to detect the event. The combination of these provide an overall risk assessment of a failure mode or a manner in which something can fail. We often apply this formula to code releases or for a more granular approach, we apply it to individual features within a release. By identifying the ways (failure modes) in which a feature could fail in production and giving that a score based on these three factors, we can quantify how much risk we have. We often teach this technique as a FMEA (failure mode effects analysis).
Risk = severity * probability * inability to detect
The most typical approach to risk mitigation has been an attempt to reduce the risk by reducing the probability. This is often done through rigorous requirements definition, thorough design reviews, and most often, lots of testing. The problem with this approach is that no matter how good we are, bugs will slip through. Users have different configurations than we have in our test environments or they use the product differently than we expect. A more recent approach to reducing the probability has been to deploy smaller changes. This was done by reducing development cycles or sprints to 1-2 weeks or in the extreme, employing continous deployment. The theory being the smaller the release, the fewer changes, and thus the less probability of failure.
A different approach to risk reduction and mitigation is by attempting to reduce the severity factor. There are several ways in which we can attempt to do this. The first is through the use of monitoring. By monitoring business metrics (checkouts, listings, signups, etc) we can quickly identify if there is a problem. Continuous deployment requires a rigorous approach to monitoring in order to quickly identify the problem and rollback the changes, thus reducing the severity or impact of the problem to your customers.
Another approach to reducing the severity is by pushing code changes to a small set of your users. Ideally this should be done through “swim lanes” but it can also be accomplished manually. In a process that we call “incremental rollout”, you would deploy new code to a small set of your servers (1-5%) and watch for issues. Once you’re satisfied with that there are no issues, roll to a larger set of your servers. Continue this “roll, pause, and observe” cycle until you have the release completely deployed. Teams that employ this strategy often take days to deploye code changes but by doing so they have much less risks of a customer impacting problem.
There are lots of ways to approach risk mitigation and we should continue to add these approaches to our toolkits. Some approaches work better than others based on the team culture and product or service being offered.