We’ve written about risk before but wanted to revisit the topic. Generally people and organizations approach risk management or mitigation from one perspective, reducing the probability of failure. We typically do this with systems by testing extensively. While this is useful there is only so much that can be found by testing in a simulated environment or with simulated users. In addition to testing, the company should consider the duration of the failure and the percentage of customers impacted, as show in the figure below.
Let’s go through each one of the items and identify what specifically you can do to accomplish these.
- Payload Size – The smaller the change, the less risk. This is the concept behind continous deployment where every code commit is released to production, assuming it passes the automated build and test processes. While continuous deployment isn’t right for every organization the concept of releasing smaller more frequent releases to reduce risk is applicable to everyone.
- Testing – As we stated before, you cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. Responsibility for the quality of a feature resides with the engineer and begins with unit tests. Test driven development is the process of writing a failing automated test and then writing or modifying code in order to pass that test. There are mixed opinions about the pros and cons of TDD but anything that makes an engineer write more unit tests is likely to improve quality.
- Monitoring – The key to monitoring is to select a few key business metrics to monitor. Per our earlier article first you need to determine if there IS a problem and then use all the other monitoring that we are used to like Nagios, New Relic, Cacti, etc to determine WHERE and WHAT is the problem.
- Rollback – As much as we’ve written on the importance of being able to rollback code I’m not sure more that I can add except having lived through two major outages without the ability to rollback I will never push code again that can’t be rolled back.
- Architecture – By splitting your applications and database along Y and Z axes will allow parts of the service to continue functioning should one part fail. This swim lane or fault isolation approach will provide greater availability for your overall service.