AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » Operations

P-I-C Process for Issue Prioritization

The separation of problems and incidents within SaaS products is critical to success. But to truly maximize value, you must also add an evaluation of the cost or impact of incidents.

As we describe in our book and as it is outlined in the ITIL toolkit, all organizations can benefit greatly from the separation of Incidents and Problems.  Incidents are customer impacting events in your production environment, or as the ITIL defines them “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity”.   Problems are the cause of one or more incidents.

The separation of these are important as most of us wish to quickly resolve incidents (reduce or minimize customer impact) while permanently resolving the underlying problems causing them.  The actions we take to resolve an incident may include workarounds or band-aids to restore service while the team works to eliminate the root cause of the problem.  We strive to restore service in whatever way possible as quickly as possible while working to find true root cause for the service disruption.

There is another important piece we typically recommend to our clients and that is to map incidents to customer complaints or customer cost.  This cost may include the real cost of handling customer contacts through phone, chat and email.  It also should include the risk of customer departure, engineering cost in workarounds or permanent fixes, overall customer satisfaction and lost opportunity of working on fixes v. other revenue enhancing features.

We know that a problem may cause one or more incidents and that an incident might be caused by one or more problems.  But that information alone isn’t enough to prioritize, with limited resources, what we attack first in short, medium and long term product and architecture changes.  Because not every incident costs us the same to fix, we need to identify what 20% of incidents drive 80% of our problems (assuming that the Pareto Principle applies).  At the very least, we should be working on those incidents and associated problems that are high in customer cost and risk relative to other incidents and problems.

By adding Customer Cost (the “C” in the P-I-C process) to our operations morning meetings, and evaluating it alongside incidents and their problems we can help make better decisions.   Classifying the severity of the incident by this “C” and using that classification to drive effort and resolution aligns your engineering operations with your  business objectives.


1 comment

Log Every Change

It's 5 PM, do you know what the last 4 changes in your production environment were? You'd better!

In well run technology organizations, any event that has the potential of impacting customers will trigger an alert that brings a cross disciplinary team together in person or on the phone to start troubleshooting the potential (or actual) problem.  Ideally the person responsible for running the incident management and problem resolution process will ask what most recently changed and then listen (or read) as the operations team reads (or displays) the change log.  We often joke that you only need to wait for someone to say “Yeah, but that change couldn’t possibly have caused this issue” to find the root cause and fix the problem.

In our experience, changes are one of the most common cause o f customer and revenue impacting issues.  Sometimes these changes are feature enhancements or functionality additions, and sometimes they are infrastructure or architectural changes.  Very often, they are simple configuration changes like an addition of a range of IP Addresses to an access control list, or the modification of DNS.  In some companies, these changes (identified as any modification to a production environment other than that made by the actual software or system itself) happen at a rate of several thousand per day.  It is virtually impossible to track them unless a change logging system is put in place.  Very often, it is the change that is undocumented and therefore difficult to isolate and roll back that costs the company the greatest downtime or revenue.

Too many companies allow too many changes to go undocumented.  The most commonly cited reason for a lack of change logging is that it simply takes too long to log each and every change.  But change logging doesn’t have to be cumbersome and it need not always include the notion of risk management inherent to a change management system.  Just the logging of a change for later identification can save between hundreds and millions of dollars of revenue and hundreds or thousands of customers – especially in a SaaS environment.  Something as simple as always logging the time, date, reason for a change, person making the change and the system being modified can make a world of difference.  Many web enabled tools offered by companies like Service-Now make such logging very simple.  Most tools offer smtp interfaces that allow people to make a change and email it to the system.  For a minute or two of time per change, hours can be saved in customer impact.

Log your changes – every change, every time.


3 comments

Crisis Management – Normal Accident Theory and High Reliability Theory

The partial meltdown of TMI-2 at Three Mile Island in 1979 is one of the best known crisis situations within the US and was the source of several books, and at least one movie.  It also generated two theories relevant to crisis management.

Charles Perrow’s Normal Accident Theory (NAT), described in his book Normal Accidents, states that the complexity inherent to tightly coupled technology systems makes accidents inevitable.  Perrow’s hypothesis is that the tight coupling causes interactions to escalate rapidly and without obstruction.  “Normal” is a nod to the inevitability of such accidents.

Todd LaPorte, who founded the Berkeley school of High Reliability Theory, believes that there are organizational strategies to achieve high reliability even in the face of such tight coupling.  The two theories have been debated for quite some time.  While the authors don’t completely agree as to how they can coexist (LaPorte believes that they are complimentary and Perrow believes that they are useful for the purposes of comparison), we believe there is something to be gained from them.

One paradox from these debates becomes intuitively obvious to our pursuit of high availability and highly scalable systems:  The better we are at building systems that avoid problems and crises, the less practice we have in solving problems and crises.  As the practice of resolving failures are critical to our learning, we become more and more inept at rapidly resolving these failures as their frequency decreases.  Therefore, as we get better at building fault tolerant and scalable systems, we get worse at resolving crisis situations that are almost certain to happen at some point.

Weick and Sutcliffe have a solution to this paradox that we paraphrase as “organizational mindfulness”.  They identify 5 practices for developing this mindfulness:

1)      Preoccupation with failure.  This practice is all about monitoring IT systems and reporting errors in a timely fashion.  Success, they argue, narrows perceptions and breeds overconfidence.   To combat the resulting complacency, organizations need complete transparency into system faults and failures.  Reports should be widely distributed and discussed frequently such as in our oft recommended “operations review” process outlined within the Art of Scalability.

2)      Reluctance to simplify interpretations.  Take nothing for granted and seek input from diverse sources.  Don’t try to box failures into expected behavior and act with a healthy bit of paranoia.

3)      Sensitivity to operations.  Look at detail data at the minute level, such as we’ve suggested in our posts on monitoring.  Include the usage of real time data and make ongoing assessments and continual updates of this data.  We think our book and our post on monitoring strategies have some good suggestions on this topic.

4)      Commitment to resilience.  Build excess capability by rotating positions and training your people in new skills.  Former employees of eBay operations can attest that DBAs, SAs and Network Engineers used to be rotated through the operations center to do just this.  Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.

5)      Deference to expertise.  During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem.  Our book also suggests creating a competency around crisis management such as a “technical duty officer” in the operations center.

We would add that every operations team should use every failure as a learning opportunity, especially in those environments in which failures are infrequent.  A good way to do this is to leverage the post mortem process.


4 comments

VP of Operations

One of the most common questions we get from individuals is “what is the path to becoming a CTO?” We posted about this before and focused on the skill sets required as opposed to the path to get there.  We highlighted 1) good knowledge of business in general 2) great technical experience 3) great leadership 4) great manager 4) great communicator and 5) willing to let go.  This time we’re going to one of the jobs that is often a stepping stone to the CTO job.

The VP of Operations is the person who leads the Technology Operations or Production Operations team.  This team has responsibility for running the hardware and software systems of the company. For SaaS or Web2.0 companies this is the revenue generating systems. For corporate IT this is the ERP, CRM, HRM, etc. This team is often comprised of project managers, operations managers, and technical leads. As the head of the Operations team the VP of Operations has responsibility for monitoring, escalating, managing issues, and reporting on availability, capacity, and utilization. Incident and problem management as well as root cause analysis (postmortem) are some of the most important jobs that their team accomplishes. In order to perform this role well the VP of Operations must have good process skills, a strong leadership presence, able to remain calm under fire, and goof overal knowledge of the system.

The VP of Operations is often also responsible for the Infrastructure team. This team is usually comprised of system administrators, database administrators, and network engineers. This team procures, deploys, maintains, and retires systems. As the head of this team the VP of Operations has requirements for budgeting, balancing time between longer term projects and daily operations on the systems. This team understands the system holistically and are often the most useful when performing scalability summits. In order to perform this role well, the VP of Operations must have a good understanding of each of the technical roles that this team is responsible for, including the databases, operating systems, and the network. This doesn’t mean in order to succeed in this role a person must be able do each of these jobs but they do need a good, solid understanding in order to converse, brainstorm, debate, and make decisions in each of these technical realms.

If you compare this list of skills that we mentioned at the top of this post with those mentioned as necessary to succeed as the VP of Operations you’ll see they overlap a good deal. Great technical experience, great leadership, and great management skills will serve you well as the head of operations and will also go a long way to developing most of the skills you will need as a CTO.

We’re approaching the end of the year, a time that many people and organizations use to reflect on what they have accomplished and what they want to accomplish next year.  A good idea as part of your personal growth is to use the list above and score yourself as honestly as possible in terms of skills.  If you’re missing some of them make sure you have some goals in place that help you acquire a few more of these each year. Do this and not only will succeed one of the important jobs that lead to the CTO job but when you do arrive at the CTO position you will be one of the successful ones.


Comments Off on VP of Operations

Splunk

This time we have a guest post from a long time friend and colleague, Chris Lalonde. During a conversation a couple of weeks ago Chris told us what he was doing with a product called Splunk, to provide monitoring, alerting, and visibility into log files. Given the importance that we place on logging and monitoring, what Chris was doing sounded interesting. We asked Chris if he would be willing to share a little about his implementation. Chris has over 15 years of experience in information technology and has provided technology solutions for Fortune 1000 companies such as eBay, TD Bank as well as government agencies like DoD, RCMP and USSS. He holds a bachelor of mechanical engineering with a concentration in robotics from Carleton University. Chris also has three patents for authentication systems and has several others pending. He was the recipient of the Director’s Award from the United States Secret Service. And now from Chris:

Having worked in technology for 15+ years I understand how challenging it can be to get visibility across your entire platform, apparently the folks who started Splunk understood this as well. What is Splunk, well it’s a tool you can use to collect data from virtually any server in virtually any format and carry out alerting, ad-hoc searches and large scale reporting.

You install the Splunk client on each of the servers you want to monitor. Currently Splunk supports Windows, Linux, OSX, Solaris, BSD and AIX so the vast majority of platforms are covered. The client typically installs in a few minutes and if you spend a little time pre-configuring things it’s easy to automate the install. Once it’s installed you have a few options you can either run a full server on each machine or turn them into lightweight forwarders and consolidate your logs. I’d recommend the latter since it gives you the log aggregation you more than likely want and the light weight clients use fewer resources than the full blown server. Additionally you can point traditional data sources at Splunk, have lots of syslogs running around why not point them at your central Splunk server and get them indexed and searchable as well.

Once you’ve got Splunk installed on your servers the uses are pretty much endless. The simplest and most obvious is system/log monitoring and alerting. You can just point Splunk at a log directory and it will automatically analyze and index all the files in that directory. Spunk will also attempt to parse the elements of the files so you can do searches on virtually every element of your file. Need to know how many times that URL was called with variable “foo” in it? Just go into the search app and type in something like “http://” “foo” and all the results are shown in two ways 1) in a graphical timeline showing the count per unit of time (hrs, days, min) and 2) in detail below the graph where you can expand out and literally get the section of the specific log file with those results.

That’s the simple version, let’s try something more interesting. Say you automatically roll code to your site and that you’d like a way to verify that code rolling tool did its job. Well just point Splunk at the directory your code rolls and configure Splunk to monitor changes to that directory and bingo once the code is in place Splunk will see the file changes. Now you can either manually search for those files or have Splunk send you an alert with the list of files that have changed in those directories.

Not enough? Splunk has a PCI compliance suite that covers all twelve PCI DSS requirements and all 228 sub-requirements including live controls monitoring, process workflow, checklists and reporting. How about indexing firewall logs, yes. How about sending those IDS logs to Splunk, sure it’ll swallow those as well. Would you like to get an alert when say you suddenly get a large increase in logs from a firewall or and IDS? Sure no problem.

Ok well that’s all great for the System Administrators, Security folks and build and release but how about the network folks. Sure, Splunk has you covered as well. Have some Cisco gear? Splunk has a Cisco suite that covers that. How about something for the DBAs, yes, MySQL and Oracle are covered as well. Again all this data is indexed and is now not only searchable, but you can create customer dashboards and alerts off of all that data.

You are probably saying “Yes I’m sure you can do that but it’s probably a nightmare to configure” in fact nothing could be further from the truth. You can use the UI to configure the report in less than 5min. If you don’t like to use a GUI, the same thing can be done via the CLI in about the same amount of time.

Ok for the bad news, Splunk is free up to the first 500MB of indexes per day after that it starts costing money also the free version is missing some of the reporting extras. 500Mb sounds like a lot until you actually start indexing things, also by default Splunk indexes many, many things that you might not want so you need to be very clear about what you do and what you do not want it to index. There are several ways of dealing with the indexes such as either by aging data out more quickly or limiting the size of the indexes. Currently I’m restricting index sizes so that I’m keeping about 30days worth of data and I haven’t had any issues.

Having said all of this I haven’t yet explored the limits of Splunk in my current configuration I’m only indexing data from about 50 servers and I’ve not run into any issues. I should note that Splunk is designed to scale horizontally and I know of people indexing data from thousands of boxes so I’m not expecting there to be a scale issue. In my experience the data is indexed in less than 5 min so I am currently using Splunk as part of our monitoring and alerting system and it has helped identify issues that would have otherwise been hidden in the logs.

Splunk saved me having to build a centralized logging infrastructure plus all the tools one needs to monitor and manage that infrastructure it’s allowed me to instantly search across all our logs and systems to identify system issues, db issues, and code issues. Something that would have taken me weeks took me 1 day to install and get working on 30 boxes and within 1 week I’d found critical events that saved my company $$.

NOTE: Thanks to Steve from Splunk for identifying that it is 500MB of indexes per day.


2 comments

A Lightweight Postmortem Process

We discussed the need to perform postmortems or AARs in our post entitled “After Action Reviews”.   Our new book includes a description of how these meetings should be run, but given the amount of time we spend teaching companies our light weight postmortem process we thought it useful to describe it in a blog post as well.

First, please understand that we think onerous processes result in the death of an organization.  We’ve often said that the point at which a company begins to hire “process engineers” is the point at which processes have gotten a bit too far.  Startups need light, adaptable processes that can grow as their needs grow over time.  The postmortem process described here is one such process.

Ideally everyone will be gathered in a single room and the room will have whiteboards that can be used during the process.  Attendees should include everyone involved with the issue or crisis and who can contribute either to a complete and accurate timeline or contribute to issues identified within the timeline.  Managers who might be assigned action items, be they process, organizational or technical should also attend the postmortem.  A single person should be identified as the Postmortem process facilitator.

Our postmortem process consists of three phases:

  1. Phase 1 focuses on generating a timeline of the events leading up to the issue or crisis.  Nothing is discussed other than the timeline during this first phase.  The phase is complete once everyone in the room agrees that there are no more items to be added to the timeline.  We typically find that even after we’ve completed the timeline phase, people will continue to remember or identify timeline worthy events in the next phase of the postmortem.
  2. Phase 2 of the postmortem consists of issue identification.  The process facilitator walks through the timeline and works with the team to identify issues.  Was it OK that the first monitor identified customer failures at 8 AM but that no one responded until noon?  Why didn’t the auto-failover of the database occur as expected?  Why did we believe that dropping the user_authorization table would allow the application to start running again?  Each and every issue is identified from the timeline, but no corrections or actions are allowed to be made until the team is done identifying issues.  Invariably, team members will start to suggest actions but it is the responsibility of the process facilitator to focus the team on issue identification during Phase 2.
  3. Phase 3 of the postmortem focuses on actions.  Each item should have at least one action associated with it.  The process facilitator walks down the list of issues and works with the team to identify an action, an owner, an expected result and a time by which it should be completed.  Using the SMART principles, each action should be specific, measurable, attainable, realistic and timely.  A single owner should be identified, even though the action may take a group or team to accomplish.

5 comments

Scaling and Monitoring the Clouds

The usefulness of Amazon’s EC2 cloud took a step forward recently with the introduction of Amazon’s own real-time monitoring, auto scaling, and load balancing product offerings. Most of these were already services offered by third parties built ontop of Amazon’s and other provider’s clouds, such as Mosso, or through custom implementations of HAProxy, etc. However, the integration should allow for easier administration and better support.     

There continue to be reservations by many companies over the feasibility of running critical systems or placing sensitive data on third party clouds. While we have not lost a major cloud computing provider yet, undoubtedly because they are all still so new, other third party storage providers have recently shutdown as noted in PCWorlds article Will Your Data Disappear When Your Online Storage Site Shuts Down?  Granted that these storage providers’ business models were very different, often giving away storage for free in hopes of up selling users on other products such as printing of photos. ISP’s and hosting providers do go out of business all the time, leaving customers in the lurch. Failure is not reserved for small businesses as we’ve seen recently with banks and car companies. As Alan Williamson, co-founder of AW2.0 a cloud computing firm, stated “Users cannot absolve themselves from being 100 percent responsible for their own data.” The cloud computing offerings are becoming more mature but they still require companies to understand the pros and cons in order to make wise decisions and plans in the event of service outages or business failures.


Comments Off on Scaling and Monitoring the Clouds

Incidents and Problems

On 19 April 1951, MacArthur gave a farewell speech to Congress upon being relieved of his command in Korea. It included the following: “But once war is forced upon us, there is no other alternative than to apply every available means to bring it to a swift end. War’s very object is victory, not prolonged indecision. In war there is no substitute for victory.” Reading this recently, I was reminded of how tech teams should approach service outages. Too often teams get confused about the priority of restoring service versus finding the root cause. We will be the first ones to tell you that you need to instill a culture of excellence that does not allow mistakes or issues to happen twice. However, during the outage, the first priority should be to restore service as quickly as possible. If you have time to gather data, like core dumps, that later will be valuable for determining root cause, great, but focus on getting the site or service restored. 

The Information Technology Infrastructure Library does a great job explaining the differences between what they refer to as Incidents and Problems. An Incident is “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services…” While a Problem is “the unknown root cause of one or more existing or potential Incidents.” The ITIL has different processes for managing each. The goal of Incident Managment is to “restore normal operations as quickly as possible…” while the goal of Problem Management is “to minimize the impact of problems…”

As you can imagine their is often conflict between these two goals. A possible solution offered by the ITIL is to form a plan of attack for the next occurrence of the problem that outlines the following:

  • What diagnostics to collect
  • How long to allow for diagnostics before service is restored
  • Prepare the necessary resources (people, process, and technology) prior to the incident
  • Communicate the plan to the stakeholders

If you like this topic you’ll enjoy Chapters 8 and 9 of The Art of Scalability, where the management of issues and crisises are discussed in detail.


Comments Off on Incidents and Problems

After Action Review

Is your company a “learning organization”, committed to continuous improvement and not willing to repeat mistakes?  If not you should be and if you are you should be performing postmortems or After Action Reviews (AAR) on all your projects and releases. Before we get into the purpose of the AAR we should address those organizations that are not dedicated to learning.  If your organization continues to stumble in the dark stubbing its proverbial toe on the same piece of furniture but refusing to move it, stop and move the furniture!  If your site continues to have availability issues then apologize, mean it, and fix it. Sooner or later your customers will leave frustrated at your inability to learn from your mistakes.

If you’re part of the other type of organization that strives to learn from mistakes and not repeat them, After Action Reviews are for you.  As covered in the Inc.com article Leadership: Armed with Data, companies repeat mistakes because they either fail to figure out what went wrong or they fail to institutionalize the fix.

In a typical AAR, the project’s stated goals or objectives are compared with observed results by the project team and a discussion is conducted to identify why the results differed. Sometimes the team can identify what went wrong and why. Other times the team will know what went wrong but perhaps not the reason why. It is okay to leave with open action items for investigations. It is not okay to leave without people assigned to document, implement, study, or report to the team later on the how the discrepancy is going to be improved. If the team only identifies the problem but doesn’t do something to keep from experiencing it again, they are only half way done.  Don’t let these lessons learned drop out of your organization’s collective memory. Complete the process by institutionalizing the solution.

There are lots of resources for learning how to perform an effective AAR.  Get in the habit of conducting them after projects and institutionalizing the solutions.


1 comment

Checklists

The “Annals of Medicine: The Checklist” is an article from the New Yorker in Dec 2007.  Besides reminding us that we really want to avoid a trip to the Intensive Care Unit, it also spells out how checklists are important when performing complex tasks, even if they tend to be routine.  One study showed the implementation of a 5 step process, that was strictly adhered to, prevented eight deaths in just over a years time.  The article states “Checklists established a higher standard of baseline performance.”

Another article “Study: A Simple Surgery Checklist Saves Lives” in Time, describes similar studies and findings.  In the study described, death rates dropped from 1.5% to 0.8%.  Both articles mention the use of checklists by pilots, due to the complexity of the systems and machines that they operate.

Your system, including the application as well as the entire development and deployment process is likely to be very complex. The lesson we should take away from these articles for all our technology teams is that checklists are important, they reduce the number of problems caused by human error.  You don’t need hundreds of steps, the few key steps are all that is required, and then strict adherence to it.  When you finish the release at 2am you’re probably not thinking as clearly as you normally do, don’t rely on your memory for checking the site.  Have a checklist for critical parts of the application to verify before you head to bed.


Comments Off on Checklists