AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

How to Reduce Risk

We’ve written about risk before but wanted to revisit the topic. Generally people and organizations approach risk management or mitigation from one perspective, reducing the probability of failure. We typically do this with systems by testing extensively. While this is useful there is only so much that can be found by testing in a simulated environment or with simulated users. In addition to testing, the company should consider the duration of the failure and the percentage of customers impacted, as show in the figure below.

Let’s go through each one of the items and identify what specifically you can do to accomplish these.

  • Payload Size – The smaller the change, the less risk. This is the concept behind continous deployment where every code commit is released to production, assuming it passes the automated build and test processes. While continuous deployment isn’t right for every organization the concept of releasing smaller more frequent releases to reduce risk is applicable to everyone.
  • Testing – As we stated before, you cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. Responsibility for the quality of a feature resides with the engineer and begins with unit tests. Test driven development is the process of writing a failing automated test and then writing or modifying code in order to pass that test. There are mixed opinions about the pros and cons of TDD but anything that makes an engineer write more unit tests is likely to improve quality.
  • Monitoring – The key to monitoring is to select a few key business metrics to monitor. Per our earlier article first you need to determine if there IS a problem and then use all the other monitoring that we are used to like Nagios, New Relic, Cacti, etc to determine WHERE and WHAT is the problem.
  • Rollback – As much as we’ve written on the importance of being able to rollback code I’m not sure more that I can add except having lived through two major outages without the ability to rollback I will never push code again that can’t be rolled back.
  • Architecture – By splitting your applications and database along Y and Z axes will allow parts of the service to continue functioning should one part fail. This swim lane or fault isolation approach will provide greater availability for your overall service.

Comments RSS TrackBack 3 comments

  • Moon Loggins

    in October 12th, 2012 @ 20:47

    It¡¦s actually a great and useful piece of information. I¡¦m satisfied that you simply shared this useful info with us. Please stay us informed like this. Thank you for sharing.

  • Kandace Cranford

    in October 25th, 2012 @ 10:50

    I needed to draft you a very little note to be able to say thank you over again with your striking techniques you’ve documented at this time. It has been open-handed of you to grant extensively what exactly most of us could possibly have made available for an electronic book to help with making some cash for their own end, especially seeing that you could possibly have tried it if you ever decided. These guidelines additionally served as the great way to realize that other individuals have the same dreams just as my own to grasp lots more in terms of this problem. I’m sure there are lots of more pleasurable opportunities up front for individuals who scan your blog post.

  • The End of Scalability?

    in February 26th, 2013 @ 22:46

    […] Risk As we’ve written about before, risk has several high-level components (probability of an incident, duration, and % of customers […]