Archive for the ‘Operations’ Category

Morning Operations Meeting

Sunday, August 29th, 2010

You get paid to deliver a service.  You want to deliver that service to the level of your customers’ expectations, or at least to some internally defined level.  So how often do you meet to discuss your service delivery quality?

In our experience, most companies meet only when there is a problem.  Day in and day out many Software as a Service companies will operate their services throughout the day and simply not take the time to step back and look at the last day’s worth of issues, all open issues that are not yet resolved and diagnose service delivery problems.  How could this be you ask?  Well, we honestly don’t know!

As we’ve written before, if you are a SaaS company your business is predicated first and foremost on SERVICE DELIVERY!  Developing software is important – but what makes you money is the delivery of a service.  Get this straight folks, because it is a major mind shift.

In our view, it is absolutely critical to start the business day with a review of the past day’s service delivery.  We call this the “Morning Operations Meeting” or “Morning Operations Review”.  Every day we ask our clients to review major issues from the previous day, overall service quality (response times, availability, major interruptions or bugs live on the site, etc), and all major open issues identified in past days.  Ideally the notion of an incident (a thing that happens in production and causes customer complaint) and the notion of a problem (a thing that causes an incident) are separated in this meeting.  Both should be discussed – but they are really two separate things.

Ideally this meeting will have representatives from your customer support organization, technical operations and infrastructure teams and software development teams.  Inputs to the meeting are a representation of customer complaints, complaints regarding service within the company, manual identification of issues, automated identification of issues (such as through a monitoring system to include Service Level metrics), predictive identification of future problems (such as might be the case from a capacity management team) and all appropriate service level information.

Open incidents and problems from the issue tracking system are discussed, updated, etc.  Owners are assigned to new incidents and problems (if they haven’t been already) and new issues are updated if any were missed from the previous day’s operations.

Outputs from the meeting are updated service level reports, scheduling of post mortems for large incidents, updated problem reports and data for monthly or quarterly look backs or reviews (more on this later).

If done well, the morning meeting helps inform architectural changes that are necessary in the scalability summits or in other product development and architecture meetings.  Recurring problems should be easily identified within the issue management as a result of heightened oversight and analysis of the system.

Delayed Replication

Sunday, August 22nd, 2010

Recently on the MySQL Performance Blog they had a post that did a great job explaining a problem that we often try to warn our clients about. The crux of the problem is that if you are relying only on a replica for disaster recovery then you are going to lose data when something bad happens.

For minimizing the impact of eventual consistency in our BASE applications, we want our replicas to be very near real time. This unfortunately can be unintended consequences in a disaster. Whether you’re relying on MySQL’s statement-based replication or Oracle’s redo apply replicating at the block-level, both are vulnerable to data corruption.

Any scenario resulting in data corruption on the primary will immediately be replicated to the standby. If a DBA drops a table by the time he stops cursing the drop table has been replicated to the standby. Storage subsystem or HA failover both can corrupt data files which can get propagated to the standby.

The solution to this problem is to create a standby or replica that has a delay on applying the log files. We recommend between 6 – 12 hours delay which gives you plenty of time to catch a logical corruption and stop the replication. You don’t need a large production sized server for this since you’ll never use this database in production but simply recover the database from it. Do this simple thing and it might save your data.

Probability

Monday, June 28th, 2010
Imagine your team has just pushed a hot fix for a problem.  Once the first 24hrs has passed do you relax?  How many days of not seeing the problem do you conclude the fix worked? Let’s start by discussing coin tosses.

Assuming you have a fair coin, that is just as likely to land heads as tails, the probability of getting a heads on a single flip is 50%.  Now two questions.  First, what is the chance of getting two heads in a row?  Second, what is the chance of the next flip being heads?  While these two questions seem similar they are very different.  To answer the first question, two heads in a row, we can look at all the possible combinations of two coin tosses:

(H,H) (H,T) (T,H) (T,T)

With four possible outcomes one of which is our two heads (H,H), we can easily compute that we have a 25% chance of getting two heads in a row. Another way of computing this is by multiplying the probability of getting a head on each coin toss. Because each toss is independent the likelihood of getting a head is 50% for each. We therefore have 50% * 50% = 25%.

This gets us to the second question, what is the probability that the next flip is a heads. As mentioned above, and contrary to what gamblers and sportscaster often believe, each flip is independent there is no “law of averages” that would indicate heads or tails is more likely. The first flip and the second flip and the third and each subsequent flip are independent of each other. However, we do expect that as the number of flips gets large the porportion of heads and tails approaches 50%. Another way to look at this is to go back to our diagram above.  If our first flip was a heads which of the scenarios could exist for our second flip? The answer is the scenarios of (H,H) (H,T) because they both have ‘H’ as their first flip.  Therefore the chance that the second flip is a tails is 1 out of 2 or 50%.

Back to our hot fix scenario. Let’s ignore the probability that our fix actually solved the problem and just focus on the likelihood of a problem occurring each day this week. We can get into prior probabilities in a later post. As we discussed above, independent events maintain their probability for each event so the probability of the problem occuring today is 50%, that it occurs tomorrow is 50%, and so on. A different question is, what is the probability that the problem will not occur three days in a row? Let’s first look at this visually using N = No Problem and P = Problem.

(N, N, N) (N, N, P) (N, P, N) (P, N, N) (N, P, P) (P, N, P) (P, P, N) (P, P, P)

From the figure above we can see that there is only 1 out of 8 scenarios where we have No Problem three days in a row. This equates to 1/8 = 0.125 or 12.5%. We can also solve this as we did above by multiplying our probability each day 50% * 50% * 50% = 12.5%.

One final note about independent failure events. We often tell clients to avoid synchronous calls because if one service fails (because of hardware or software) it causes others to fail. If you have 99% uptime on one service and 99% uptime on another but they are both required to service a request, the total system availability is 99% * 99% = 98% unless of course they happen to fail at the exact same time every time.  This is what we call the multiplicative effect of failure.

P-I-C Process for Issue Prioritization

Monday, June 14th, 2010

As we describe in our book and as it is outlined in the ITIL toolkit, all organizations can benefit greatly from the separation of Incidents and Problems.  Incidents are customer impacting events in your production environment, or as the ITIL defines them “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and Customer productivity”.   Problems are the cause of one or more incidents.

The separation of these are important as most of us wish to quickly resolve incidents (reduce or minimize customer impact) while permanently resolving the underlying problems causing them.  The actions we take to resolve an incident may include workarounds or band-aids to restore service while the team works to eliminate the root cause of the problem.  We strive to restore service in whatever way possible as quickly as possible while working to find true root cause for the service disruption.

There is another important piece we typically recommend to our clients and that is to map incidents to customer complaints or customer cost.  This cost may include the real cost of handling customer contacts through phone, chat and email.  It also should include the risk of customer departure, engineering cost in workarounds or permanent fixes, overall customer satisfaction and lost opportunity of working on fixes v. other revenue enhancing features.

We know that a problem may cause one or more incidents and that an incident might be caused by one or more problems.  But that information alone isn’t enough to prioritize, with limited resources, what we attack first in short, medium and long term product and architecture changes.  Because not every incident costs us the same to fix, we need to identify what 20% of incidents drive 80% of our problems (assuming that the Pareto Principle applies).  At the very least, we should be working on those incidents and associated problems that are high in customer cost and risk relative to other incidents and problems.

By adding Customer Cost (the “C” in the P-I-C process) to our operations morning meetings, and evaluating it alongside incidents and their problems we can help make better decisions.   Classifying the severity of the incident by this “C” and using that classification to drive effort and resolution aligns your engineering operations with your  business objectives.

Log Every Change

Monday, May 24th, 2010

In well run technology organizations, any event that has the potential of impacting customers will trigger an alert that brings a cross disciplinary team together in person or on the phone to start troubleshooting the potential (or actual) problem.  Ideally the person responsible for running the incident management and problem resolution process will ask what most recently changed and then listen (or read) as the operations team reads (or displays) the change log.  We often joke that you only need to wait for someone to say “Yeah, but that change couldn’t possibly have caused this issue” to find the root cause and fix the problem.

In our experience, changes are one of the most common cause o f customer and revenue impacting issues.  Sometimes these changes are feature enhancements or functionality additions, and sometimes they are infrastructure or architectural changes.  Very often, they are simple configuration changes like an addition of a range of IP Addresses to an access control list, or the modification of DNS.  In some companies, these changes (identified as any modification to a production environment other than that made by the actual software or system itself) happen at a rate of several thousand per day.  It is virtually impossible to track them unless a change logging system is put in place.  Very often, it is the change that is undocumented and therefore difficult to isolate and roll back that costs the company the greatest downtime or revenue.

Too many companies allow too many changes to go undocumented.  The most commonly cited reason for a lack of change logging is that it simply takes too long to log each and every change.  But change logging doesn’t have to be cumbersome and it need not always include the notion of risk management inherent to a change management system.  Just the logging of a change for later identification can save between hundreds and millions of dollars of revenue and hundreds or thousands of customers – especially in a SaaS environment.  Something as simple as always logging the time, date, reason for a change, person making the change and the system being modified can make a world of difference.  Many web enabled tools offered by companies like Service-Now make such logging very simple.  Most tools offer smtp interfaces that allow people to make a change and email it to the system.  For a minute or two of time per change, hours can be saved in customer impact.

Log your changes – every change, every time.