Crisis Management – Normal Accident Theory and High Reliability Theory
The partial meltdown of TMI-2 at Three Mile Island in 1979 is one of the best known crisis situations within the US and was the source of several books, and at least one movie. It also generated two theories relevant to crisis management.
Charles Perrow’s Normal Accident Theory (NAT), described in his book Normal Accidents, states that the complexity inherent to tightly coupled technology systems makes accidents inevitable. Perrow’s hypothesis is that the tight coupling causes interactions to escalate rapidly and without obstruction. “Normal” is a nod to the inevitability of such accidents.
Todd LaPorte, who founded the Berkeley school of High Reliability Theory, believes that there are organizational strategies to achieve high reliability even in the face of such tight coupling. The two theories have been debated for quite some time. While the authors don’t completely agree as to how they can coexist (LaPorte believes that they are complimentary and Perrow believes that they are useful for the purposes of comparison), we believe there is something to be gained from them.
One paradox from these debates becomes intuitively obvious to our pursuit of high availability and highly scalable systems: The better we are at building systems that avoid problems and crises, the less practice we have in solving problems and crises. As the practice of resolving failures are critical to our learning, we become more and more inept at rapidly resolving these failures as their frequency decreases. Therefore, as we get better at building fault tolerant and scalable systems, we get worse at resolving crisis situations that are almost certain to happen at some point.
Weick and Sutcliffe have a solution to this paradox that we paraphrase as “organizational mindfulness”. They identify 5 practices for developing this mindfulness:
1) Preoccupation with failure. This practice is all about monitoring IT systems and reporting errors in a timely fashion. Success, they argue, narrows perceptions and breeds overconfidence. To combat the resulting complacency, organizations need complete transparency into system faults and failures. Reports should be widely distributed and discussed frequently such as in our oft recommended “operations review” process outlined within the Art of Scalability.
2) Reluctance to simplify interpretations. Take nothing for granted and seek input from diverse sources. Don’t try to box failures into expected behavior and act with a healthy bit of paranoia.
3) Sensitivity to operations. Look at detail data at the minute level, such as we’ve suggested in our posts on monitoring. Include the usage of real time data and make ongoing assessments and continual updates of this data. We think our book and our post on monitoring strategies have some good suggestions on this topic.
4) Commitment to resilience. Build excess capability by rotating positions and training your people in new skills. Former employees of eBay operations can attest that DBAs, SAs and Network Engineers used to be rotated through the operations center to do just this. Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.
5) Deference to expertise. During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem. Our book also suggests creating a competency around crisis management such as a “technical duty officer” in the operations center.
We would add that every operations team should use every failure as a learning opportunity, especially in those environments in which failures are infrequent. A good way to do this is to leverage the post mortem process.