Everything fails! This is a mantra that we are always espousing at AKF. At some point, these failures will manifest themselves as an outage. In a SaaS world, restoring service as quickly as possible is critical. It requires having the right people available and being able to communicate with them effectively. A lack of good communications can cause an incident to drag on.
For startups and smaller companies, problems with communications during incidents is less of an issue. Systems tend to be smaller or monolithic. Teams supporting these systems also tend to be small. When something happens, everyone jumps on a call to figure out the problem. As companies grow, the number of people needed to resolve an incident grows. Coordinating communications between a large group of people becomes difficult. Adding to the chaos are executives joining the conference bridges demanding updates about service restoration.
In order to minimize the time to restore a system during an incident, companies need the right people on the call. For large, complex systems, identify the right resources to solve a problem can be difficult. We recommend swarming an issue with everyone that could be needed to resolve an incident, and then release those that are no longer needed. But, with such a large number of people, it can be difficult to coordinate communications, especially on a single conference call bridge.
Managing the communications of a large group of people working an incident is critical to minimizing the restoration time. We recommend a communication method that many of us at AKF learned in the military. It involves using multiple voice and chat channels to coordinate work and the flow of information. Before we get into the details of managing communications, we need to first look at the leadership required to effectively work the incident.
Technical Incident Manager and Incident Communications Manager
Managing a large incident is usually too much for a single individual. She cannot manage coordinating the work occurring to resolve the incident, as well as reporting status to and answering questions from executives eager to know what is going on. We recommend that companies manage incidents with two people. The first person is the individual that is responsible for directing all activities geared towards restoration of service. We call this person the Technical Incident Manager. This individual’s main job is to reduce the mean time to restoration. She needs an overall architectural knowledge of the product and systems to direct the work. She is responsible for leading the call and deescalating after diagnosis informs who needs to be involved. She identifies and diagnoses the service issues and engages the appropriate subject matter experts to assist in restoration.
The second individual is the Incident Communications Manager. He is responsible for supporting the Technical Incident Manager be listening to the technical resolution chatter and summarizing it for a non-technical audience. His focus is on communications speed, quality, and accuracy. He is the primary communications channel for both internal and external messaging. He owns the incident communications process.
Incident Communications Process
This process involves using multiple communication channels to control information and work performed. The first channel established is the Control Channel. This is in the form of a conference bridge and a chat channel. The Technical Incident Manager controls both of these channels. The second channel created is the Status Channel. This also has a voice bridge and a chat channel. The Incident Communication Manager is responsible for managing this channel.
The Control Channel is used for all communication related to the restoration of service. People only use the voice channel for immediate communication and to announce work that is occurring or address immediate questions that need to be answered. Detailed work conducted is placed in the chat channel. This reduces the chatter on the voice channel to command and control messages. It also serves as a record of actions taken that can be referenced in the post mortem/RCA process. If specific teams need to discuss the work they are performing, separate voice and chat breakout channels are created for them. They move off the main channel into their breakout channels to perform the work. The leader of these teams periodically communicates status back up to the control channel.
As the work is progressing, the Incident Communications Manager monitors the Control Channel to provide the basis for his messaging. He formulates updates that he delivers over the Status bridge and chat channel. He keeps executives and customers informed of progress and status, keeping the control channel free of requests for frequent updates and dedicated to restoring service.
This method of communications has worked well in the military for years and has been adopted by many large companies to manage their incident communications. While it is overkill for small companies, it becomes an effective process as companies grow and systems become more complex.