AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » Operations

Risk Mitigation

We define risk as being comprised of three components: severity of impact, the probability of occurrence, and the ability to detect the event. The combination of these provide an overall risk assessment of a failure mode or a manner in which something can fail. We often apply this formula to code releases or for a more granular approach, we apply it to individual features within a release. By identifying the ways (failure modes) in which a feature could fail in production and giving that a score based on these three factors, we can quantify how much risk we have. We often teach this technique as a FMEA (failure mode effects analysis).

Risk = severity * probability * inability to detect

The most typical approach to risk mitigation has been an attempt to reduce the risk by reducing the probability. This is often done through rigorous requirements definition, thorough design reviews, and most often, lots of testing. The problem with this approach is that no matter how good we are, bugs will slip through. Users have different configurations than we have in our test environments or they use the product differently than we expect. A more recent approach to reducing the probability has been to deploy smaller changes. This was done by reducing development cycles or sprints to 1-2 weeks or in the extreme, employing continous deployment. The theory being the smaller the release, the fewer changes, and thus the less probability of failure.

A different approach to risk reduction and mitigation is by attempting to reduce the severity factor. There are several ways in which we can attempt to do this. The first is through the use of monitoring. By monitoring business metrics (checkouts, listings, signups, etc) we can quickly identify if there is a problem. Continuous deployment requires a rigorous approach to monitoring in order to quickly identify the problem and rollback the changes, thus reducing the severity or impact of the problem to your customers.

Another approach to reducing the severity is by pushing code changes to a small set of your users. Ideally this should be done through “swim lanes” but it can also be accomplished manually. In a process that we call “incremental rollout”, you would deploy new code to a small set of your servers (1-5%) and watch for issues. Once you’re satisfied with that there are no issues, roll to a larger set of your servers. Continue this “roll, pause, and observe” cycle until you have the release completely deployed. Teams that employ this strategy often take days to deploye code changes but by doing so they have much less risks of a customer impacting problem.

There are lots of ways to approach risk mitigation and we should continue to add these approaches to our toolkits. Some approaches work better than others based on the team culture and product or service being offered.


Comments Off

Battle Captains and Outage Managers

The other day at a client, we were trying to describe what an outage manager does and a term from my time in the military came back to me, battle captain. The best description I could come up with for an outage manager was that they perform the same duties during an outage that a battle captain does for a unit in battle. For those non-military types, a battle captain resides in the tactical operations center (TOC) of a unit and take care of tasks such as tracking the battle, enforcing orders, managing information, and making decisions based on commander’s intent when the commander is unavailable. This is exactly what an outage manager does for an outage – keep track of the outage (timeline), follow up with people to make sure tasks are completed (i.e. investigate logs for errors), makes sure information is retained and passed along, and when the VP of Ops or CTO is briefing the CEO or on the phone with a vendor, the outage manager makes decisions.

From an atricle What Now, Battle Captain? The Who, What and How of the Job on Nobody’s Books, but Found in Every Unit’s TOC by CPT Marcus F. de Oliveira, Deputy Chief, Leaders’ Training Program, JRTC here is the definition of the role:

The battle captain should be capable of assisting the command group in controlling the brigade or battalion. Remember, the commander commands the unit, and the XO is the chief of staff; BUT, those officers and the S3 must rest. They will also get pulled away from current operations to plan future operations, or receive orders from higher headquarters. The battle captain’s role then is to serve as a constant in the CP, someone who keeps his head in the current battle, and continuously assists commanders in the command and control of the fight.

A great battle captain can provide a tactical advantage to units in combat. If you have a great outage manager or have seen one work, you know how important they can be in reducing the duration of the outage. Most outage managers have primary jobs such as managing a shift in the NOC or managing an ops team but when an outage occurs they jump into the role of an outage manager. If you don’t currently have an outage manager junior military officers (JMO) just leaving the service often make great ones.


Comments Off

DevOps

What do you call a set of processes or systems for coordination between development and operations teams? Give up? Try “DevOps”. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, but it has recently grown into a discipline of its own. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.” Tracking down the history of the DevOps Wikipedia page, shows that this topic is a recent entry.

There are a lot of other resources on the web that many not have been using this exact term but have certainly been dealing with the development and operations coordination challenge for years.  Dev2Ops.org is one such group and posted earlier this year their definition of DevOps “an umbrella concept that refers to anything that smoothes out the interaction between development and operations.”  They continue in their post highlighting that concept of DevOps is in response to the growing awareness of a disconnect between development and operations. While I think that is correct I think it’s only partially the reason for the recent interest in defining DevOps.

With ideas such as continuous deployment and Amazon’s two-pizza rule for highly autonomous dev/ops teams there is a blurring of roles between development and operations. Another driver of this movement is cloud computing. Developers can procure, deploy, and support virtual instances much easier than ever before with the advent of GUI or API based cloud control interfaces. What used to be clearly defined career paths and sets of responsibilities are now being blended to create a new, more efficient and highly sought after technologist. A developer who understands operations support or a system administrator who understands programming are utility players that are very valuable.

While perhaps DevOps is a new term to an old problem, it is promising to realize that organizations are taking interest in the challenges of coordination between development and operations. It is even more important that organizations pay attention to this topic given the blurring of roles.


Comments Off

How To Restore Service in Less Than 5 Minutes

What’s the first thing you do when your site is down? For most people they pull up Nagios, or the like, and check all the servers, databases, and storage systems. Someone else might start tail’ing or grep’ing the log files. Tech executives by now are answering phone calls or sending email updates about the outage and expected downtime. Software developers are called in go over the log files in more detail and network engineers are asked to jump on devices to make sure they are responding properly.

What’s missing from the above scenario? Nobody looked up the last change that went into production. In our experience, 90+% of the problems in production are caused by the latest change, be it a code release, firewall change, or applying DDL or DML to the database. And it’s a sure bet that latest change is the problem if the person who made it says “That couldn’t have caused the outage.” In fact there is probably a high degree of correlation between how emphatically they make their statement and the probability that it is the cause of the incident.

Just the other day one of our friends had an outage call where the network security team was arguing that their latest change could not have possibly caused the outage. Guess what caused the outage…that’s right the firewall change.

So, how do you solve 90+% of your problems in less than 5 minutes? You immediately rollback the last change you made to your production environment. You might be saying to yourself “But how can I do that when I don’t know all the changes that are happening in my production environment?” And that (as Paul Harvey used to say) is the rest of the story.

You have to keep track of every single change that takes place in your production environment. This is called “change tracking” and is different from “change management”. Change tracking is simply keeping track, in any format, of all the changes that happen in production. These changes can be kept in a word document, spreadsheet, database, IRC channel, or even an unmonitored email account. Anything that 1) allows fast entry, so people have no excuse to not use it, and 2) can be retrieved immediately when needed during an outage.


1 comment

Morning Operations Meeting

If you deliver a service through software, you need to discuss your service delivery quality every day! Here's how:

You get paid to deliver a service.  You want to deliver that service to the level of your customers’ expectations, or at least to some internally defined level.  So how often do you meet to discuss your service delivery quality?

In our experience, most companies meet only when there is a problem.  Day in and day out many Software as a Service companies will operate their services throughout the day and simply not take the time to step back and look at the last day’s worth of issues, all open issues that are not yet resolved and diagnose service delivery problems.  How could this be you ask?  Well, we honestly don’t know!

As we’ve written before, if you are a SaaS company your business is predicated first and foremost on SERVICE DELIVERY!  Developing software is important – but what makes you money is the delivery of a service.  Get this straight folks, because it is a major mind shift.

In our view, it is absolutely critical to start the business day with a review of the past day’s service delivery.  We call this the “Morning Operations Meeting” or “Morning Operations Review”.  Every day we ask our clients to review major issues from the previous day, overall service quality (response times, availability, major interruptions or bugs live on the site, etc), and all major open issues identified in past days.  Ideally the notion of an incident (a thing that happens in production and causes customer complaint) and the notion of a problem (a thing that causes an incident) are separated in this meeting.  Both should be discussed – but they are really two separate things.

Ideally this meeting will have representatives from your customer support organization, technical operations and infrastructure teams and software development teams.  Inputs to the meeting are a representation of customer complaints, complaints regarding service within the company, manual identification of issues, automated identification of issues (such as through a monitoring system to include Service Level metrics), predictive identification of future problems (such as might be the case from a capacity management team) and all appropriate service level information.

Open incidents and problems from the issue tracking system are discussed, updated, etc.  Owners are assigned to new incidents and problems (if they haven’t been already) and new issues are updated if any were missed from the previous day’s operations.

Outputs from the meeting are updated service level reports, scheduling of post mortems for large incidents, updated problem reports and data for monthly or quarterly look backs or reviews (more on this later).

If done well, the morning meeting helps inform architectural changes that are necessary in the scalability summits or in other product development and architecture meetings.  Recurring problems should be easily identified within the issue management as a result of heightened oversight and analysis of the system.


Comments Off

Delayed Replication

Do you think your database replica will save your data in a disaster? Think again because there are a lot of scenarios that will cause you to corrupt all your data.

Recently on the MySQL Performance Blog they had a post that did a great job explaining a problem that we often try to warn our clients about. The crux of the problem is that if you are relying only on a replica for disaster recovery then you are going to lose data when something bad happens.

For minimizing the impact of eventual consistency in our BASE applications, we want our replicas to be very near real time. This unfortunately can be unintended consequences in a disaster. Whether you’re relying on MySQL’s statement-based replication or Oracle’s redo apply replicating at the block-level, both are vulnerable to data corruption.

Any scenario resulting in data corruption on the primary will immediately be replicated to the standby. If a DBA drops a table by the time he stops cursing the drop table has been replicated to the standby. Storage subsystem or HA failover both can corrupt data files which can get propagated to the standby.

The solution to this problem is to create a standby or replica that has a delay on applying the log files. We recommend between 6 – 12 hours delay which gives you plenty of time to catch a logical corruption and stop the replication. You don’t need a large production sized server for this since you’ll never use this database in production but simply recover the database from it. Do this simple thing and it might save your data.


1 comment

Probability

Imagine your team has just pushed a hot fix for a problem.  Once the first 24hrs has passed do you relax?  How many days of not seeing the problem do you conclude the fix worked? Let’s start by discussing coin tosses.

Assuming you have a fair coin, that is just as likely to land heads as tails, the probability of getting a heads on a single flip is 50%.  Now two questions.  First, what is the chance of getting two heads in a row?  Second, what is the chance of the next flip being heads?  While these two questions seem similar they are very different.  To answer the first question, two heads in a row, we can look at all the possible combinations of two coin tosses:

(H,H) (H,T) (T,H) (T,T)

With four possible outcomes one of which is our two heads (H,H), we can easily compute that we have a 25% chance of getting two heads in a row. Another way of computing this is by multiplying the probability of getting a head on each coin toss. Because each toss is independent the likelihood of getting a head is 50% for each. We therefore have 50% * 50% = 25%.

This gets us to the second question, what is the probability that the next flip is a heads. As mentioned above, and contrary to what gamblers and sportscaster often believe, each flip is independent there is no “law of averages” that would indicate heads or tails is more likely. The first flip and the second flip and the third and each subsequent flip are independent of each other. However, we do expect that as the number of flips gets large the porportion of heads and tails approaches 50%. Another way to look at this is to go back to our diagram above.  If our first flip was a heads which of the scenarios could exist for our second flip? The answer is the scenarios of (H,H) (H,T) because they both have ‘H’ as their first flip.  Therefore the chance that the second flip is a tails is 1 out of 2 or 50%.

Back to our hot fix scenario. Let’s ignore the probability that our fix actually solved the problem and just focus on the likelihood of a problem occurring each day this week. We can get into prior probabilities in a later post. As we discussed above, independent events maintain their probability for each event so the probability of the problem occuring today is 50%, that it occurs tomorrow is 50%, and so on. A different question is, what is the probability that the problem will not occur three days in a row? Let’s first look at this visually using N = No Problem and P = Problem.

(N, N, N) (N, N, P) (N, P, N) (P, N, N) (N, P, P) (P, N, P) (P, P, N) (P, P, P)

From the figure above we can see that there is only 1 out of 8 scenarios where we have No Problem three days in a row. This equates to 1/8 = 0.125 or 12.5%. We can also solve this as we did above by multiplying our probability each day 50% * 50% * 50% = 12.5%.

One final note about independent failure events. We often tell clients to avoid synchronous calls because if one service fails (because of hardware or software) it causes others to fail. If you have 99% uptime on one service and 99% uptime on another but they are both required to service a request, the total system availability is 99% * 99% = 98% unless of course they happen to fail at the exact same time every time.  This is what we call the multiplicative effect of failure.


1 comment

P-I-C Process for Issue Prioritization

The separation of problems and incidents within SaaS products is critical to success. But to truly maximize value, you must also add an evaluation of the cost or impact of incidents.

As we describe in our book and as it is outlined in the ITIL toolkit, all organizations can benefit greatly from the separation of Incidents and Problems.  Incidents are customer impacting events in your production environment, or as the ITIL defines them “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity”.   Problems are the cause of one or more incidents.

The separation of these are important as most of us wish to quickly resolve incidents (reduce or minimize customer impact) while permanently resolving the underlying problems causing them.  The actions we take to resolve an incident may include workarounds or band-aids to restore service while the team works to eliminate the root cause of the problem.  We strive to restore service in whatever way possible as quickly as possible while working to find true root cause for the service disruption.

There is another important piece we typically recommend to our clients and that is to map incidents to customer complaints or customer cost.  This cost may include the real cost of handling customer contacts through phone, chat and email.  It also should include the risk of customer departure, engineering cost in workarounds or permanent fixes, overall customer satisfaction and lost opportunity of working on fixes v. other revenue enhancing features.

We know that a problem may cause one or more incidents and that an incident might be caused by one or more problems.  But that information alone isn’t enough to prioritize, with limited resources, what we attack first in short, medium and long term product and architecture changes.  Because not every incident costs us the same to fix, we need to identify what 20% of incidents drive 80% of our problems (assuming that the Pareto Principle applies).  At the very least, we should be working on those incidents and associated problems that are high in customer cost and risk relative to other incidents and problems.

By adding Customer Cost (the “C” in the P-I-C process) to our operations morning meetings, and evaluating it alongside incidents and their problems we can help make better decisions.   Classifying the severity of the incident by this “C” and using that classification to drive effort and resolution aligns your engineering operations with your  business objectives.


1 comment

Log Every Change

It's 5 PM, do you know what the last 4 changes in your production environment were? You'd better!

In well run technology organizations, any event that has the potential of impacting customers will trigger an alert that brings a cross disciplinary team together in person or on the phone to start troubleshooting the potential (or actual) problem.  Ideally the person responsible for running the incident management and problem resolution process will ask what most recently changed and then listen (or read) as the operations team reads (or displays) the change log.  We often joke that you only need to wait for someone to say “Yeah, but that change couldn’t possibly have caused this issue” to find the root cause and fix the problem.

In our experience, changes are one of the most common cause o f customer and revenue impacting issues.  Sometimes these changes are feature enhancements or functionality additions, and sometimes they are infrastructure or architectural changes.  Very often, they are simple configuration changes like an addition of a range of IP Addresses to an access control list, or the modification of DNS.  In some companies, these changes (identified as any modification to a production environment other than that made by the actual software or system itself) happen at a rate of several thousand per day.  It is virtually impossible to track them unless a change logging system is put in place.  Very often, it is the change that is undocumented and therefore difficult to isolate and roll back that costs the company the greatest downtime or revenue.

Too many companies allow too many changes to go undocumented.  The most commonly cited reason for a lack of change logging is that it simply takes too long to log each and every change.  But change logging doesn’t have to be cumbersome and it need not always include the notion of risk management inherent to a change management system.  Just the logging of a change for later identification can save between hundreds and millions of dollars of revenue and hundreds or thousands of customers – especially in a SaaS environment.  Something as simple as always logging the time, date, reason for a change, person making the change and the system being modified can make a world of difference.  Many web enabled tools offered by companies like Service-Now make such logging very simple.  Most tools offer smtp interfaces that allow people to make a change and email it to the system.  For a minute or two of time per change, hours can be saved in customer impact.

Log your changes – every change, every time.


3 comments

Crisis Management – Normal Accident Theory and High Reliability Theory

The partial meltdown of TMI-2 at Three Mile Island in 1979 is one of the best known crisis situations within the US and was the source of several books, and at least one movie.  It also generated two theories relevant to crisis management.

Charles Perrow’s Normal Accident Theory (NAT), described in his book Normal Accidents, states that the complexity inherent to tightly coupled technology systems makes accidents inevitable.  Perrow’s hypothesis is that the tight coupling causes interactions to escalate rapidly and without obstruction.  “Normal” is a nod to the inevitability of such accidents.

Todd LaPorte, who founded the Berkeley school of High Reliability Theory, believes that there are organizational strategies to achieve high reliability even in the face of such tight coupling.  The two theories have been debated for quite some time.  While the authors don’t completely agree as to how they can coexist (LaPorte believes that they are complimentary and Perrow believes that they are useful for the purposes of comparison), we believe there is something to be gained from them.

One paradox from these debates becomes intuitively obvious to our pursuit of high availability and highly scalable systems:  The better we are at building systems that avoid problems and crises, the less practice we have in solving problems and crises.  As the practice of resolving failures are critical to our learning, we become more and more inept at rapidly resolving these failures as their frequency decreases.  Therefore, as we get better at building fault tolerant and scalable systems, we get worse at resolving crisis situations that are almost certain to happen at some point.

Weick and Sutcliffe have a solution to this paradox that we paraphrase as “organizational mindfulness”.  They identify 5 practices for developing this mindfulness:

1)      Preoccupation with failure.  This practice is all about monitoring IT systems and reporting errors in a timely fashion.  Success, they argue, narrows perceptions and breeds overconfidence.   To combat the resulting complacency, organizations need complete transparency into system faults and failures.  Reports should be widely distributed and discussed frequently such as in our oft recommended “operations review” process outlined within the Art of Scalability.

2)      Reluctance to simplify interpretations.  Take nothing for granted and seek input from diverse sources.  Don’t try to box failures into expected behavior and act with a healthy bit of paranoia.

3)      Sensitivity to operations.  Look at detail data at the minute level, such as we’ve suggested in our posts on monitoring.  Include the usage of real time data and make ongoing assessments and continual updates of this data.  We think our book and our post on monitoring strategies have some good suggestions on this topic.

4)      Commitment to resilience.  Build excess capability by rotating positions and training your people in new skills.  Former employees of eBay operations can attest that DBAs, SAs and Network Engineers used to be rotated through the operations center to do just this.  Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.

5)      Deference to expertise.  During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem.  Our book also suggests creating a competency around crisis management such as a “technical duty officer” in the operations center.

We would add that every operations team should use every failure as a learning opportunity, especially in those environments in which failures are infrequent.  A good way to do this is to leverage the post mortem process.


Comments Off