Incidents and Problems
On 19 April 1951, MacArthur gave a farewell speech to Congress upon being relieved of his command in Korea. It included the following: “But once war is forced upon us, there is no other alternative than to apply every available means to bring it to a swift end. War’s very object is victory, not prolonged indecision. In war there is no substitute for victory.” Reading this recently, I was reminded of how tech teams should approach service outages. Too often teams get confused about the priority of restoring service versus finding the root cause. We will be the first ones to tell you that you need to instill a culture of excellence that does not allow mistakes or issues to happen twice. However, during the outage, the first priority should be to restore service as quickly as possible. If you have time to gather data, like core dumps, that later will be valuable for determining root cause, great, but focus on getting the site or service restored.
The Information Technology Infrastructure Library does a great job explaining the differences between what they refer to as Incidents and Problems. An Incident is “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services…” While a Problem is “the unknown root cause of one or more existing or potential Incidents.” The ITIL has different processes for managing each. The goal of Incident Managment is to “restore normal operations as quickly as possible…” while the goal of Problem Management is “to minimize the impact of problems…”
As you can imagine their is often conflict between these two goals. A possible solution offered by the ITIL is to form a plan of attack for the next occurrence of the problem that outlines the following:
- What diagnostics to collect
- How long to allow for diagnostics before service is restored
- Prepare the necessary resources (people, process, and technology) prior to the incident
- Communicate the plan to the stakeholders
If you like this topic you’ll enjoy Chapters 8 and 9 of The Art of Scalability, where the management of issues and crisises are discussed in detail.
