AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » Operations

Keep Asking Why

We’ve written about After Action Reviews and Postmortems before but I thought it would be worth providing an updated example of the importance of getting to true root causes. Yes, that’s plural “causes” because there are almost always multiple root causes, not just one.

I recently watched a postmortem that followed AKF’s recommended T-I-A process of creating a timeline, identifying the issues, and assigning action items to owners. However, the list of issues included such things as “deployment script failed to stop when a node was down.” While this is certainly a contributing issue to the incident, it is not a root cause. It’s not a root cause in that if you fix the script but not the process that allowed the script to fail in the first place, you’re only solving for a very narrow problem. If you dig deeper (keep asking “why”), you’ll soon realize that there are truer root causes perhaps there is no process for code reviewing scripts or there is no process for testing scripts or there is no build vs. buy decision for scripting tools whereby you’re having to home-grow all the tools. These root causes, once solved, will fix a much broader set of problems that you will encounter instead of just fixing a broken script.

In order to be a true learning organization you need to dig deep. Get to true root cause by keep asking “why.” You know you’ve achieved the correct depth when solving the issue will fix many failure modes and not just the one that caused the incident.


Comments Off on Keep Asking Why

WebScaleSQL

Just over a month ago, the WebscaleSQL collaboration project was launched.  This project aims to create a community-developed branch of the popular MySQL DBMS that incorporates features to make large-scale deployments easier.  As many of our clients run large clusters of MySQL on commodity hardware (as a means to reduce costs and improve scalability) the WebScaleSQL project naturally drew our attention.

The project is currently developed by a small collaboration of engineers from Facebook, Google, Twitter, and LinkedIn.  Certainly no strangers to scalability challenges, the developers from these web giants all face similar challenges and must work continuously to improve performance and scalability of their MySQL deployments to remain competitive.  Aimed at minimizing the duplication of effort across these engineering teams, WebScaleSQL’s development is run as an open collaboration between these major contributors.  Contributions aren’t limited to major companies, however, and participation from outside engineers is encouraged.

Despite having only been around a few weeks, the WebScaleSQL project boasts some significant improvements overs its upstream parent (Oracle’s MySQL ver 5.6).  These advances include an automated testing framework, a stress testing suite, query optimizations and host of other changes that promise to improve the performance, testing, and deployment of large-scale DBs.

As WebScaleSQL matures, we’ll continue to track its development and report on our clients’  experiences (both good and bad) working with this new “scale friendly” branch of MySQL.

WebScaleSQL source is currently hosted on GitHub (https://github.com/webscalesql/webscalesql-5.6) and released under version 2 of the GNU public license.


Comments Off on WebScaleSQL

A False Sense Of Security and Complacency = Revenue Loss

Its Monday morning and past Saturday evening issues in one of your datacenters triggered a failover to your second data center for service restoration. In other words, all customer traffic has been routed to a single datacenter. The failover was executed flawlessly and the team went back to bed waiting for Monday morning to permanently fix the issue so traffic could once again run out of both datacenters. On Monday morning, we are expecting a flash sale and will make close to $8000 a minute at peak. All is well and there is nothing to worry about. Right?

Hopefully you cringed at the above scenario. What if the data center you are running out of suffers from a failure? Or what if the only data center and its components that is now live for all of your traffic simply wasn’t sized correctly for acceptable performance during a traffic spike?

If it hasn’t happened yet, it will. If that were the case, your business would stand to lose significant revenue. We see it over and over again with many clients and have also experienced it in practice. Multiple datacenters can serve as a false sense of security and teams can become complacent. Remember, assume everything will fail as a monolith. If you are only running out of a single data center and the other is unable to take traffic, you now have a SPOF and as a whole the DC is a monolith. As a tech ops leader you have to drive the right sense of urgency and lead your team to have the right mindset. Restoring service with a failover is perfectly acceptable. However, the team cannot stop there. They must quickly diagnose the problem and return the site to normal service, which means you are once again running out of two datacenters. Don’t let the false sense of security slip into your ops teams. If you spot it, call it out and explain why.

To help combat complacency from setting in, we recommend considering the following:

  1. Run a Morning Ops meeting with your business and review issues from the past 24 hours. Determine which issues need to undergo a postmortem. See one of our earlier blogs for more information: http://akfpartners.com/techblog/2010/08/29/morning-operations-meeting/
  2. Communicate to your team and your business on the failure and what is being done about it.
  3. Run a postmortem determine multiple causes and actions and owners to address the causes: http://akfpartners.com/techblog/2009/09/03/a-lightweight-post-mortem-process/
  4. Always restore your systems to normal service as quickly as possible. If you have split your architecture along the Y or Z axis and one of the swim lanes fails or an entire datacenter fails, you need to bring it back up as quickly as possible. See one of our past blogs for more details on splitting your architecture: http://akfpartners.com/techblog/2008/05/30/fault-isolative-architectures-or-“swimlaning”/

 


Comments Off on A False Sense Of Security and Complacency = Revenue Loss

Risk Mitigation

We define risk as being comprised of three components: severity of impact, the probability of occurrence, and the ability to detect the event. The combination of these provide an overall risk assessment of a failure mode or a manner in which something can fail. We often apply this formula to code releases or for a more granular approach, we apply it to individual features within a release. By identifying the ways (failure modes) in which a feature could fail in production and giving that a score based on these three factors, we can quantify how much risk we have. We often teach this technique as a FMEA (failure mode effects analysis).

Risk = severity * probability * inability to detect

The most typical approach to risk mitigation has been an attempt to reduce the risk by reducing the probability. This is often done through rigorous requirements definition, thorough design reviews, and most often, lots of testing. The problem with this approach is that no matter how good we are, bugs will slip through. Users have different configurations than we have in our test environments or they use the product differently than we expect. A more recent approach to reducing the probability has been to deploy smaller changes. This was done by reducing development cycles or sprints to 1-2 weeks or in the extreme, employing continous deployment. The theory being the smaller the release, the fewer changes, and thus the less probability of failure.

A different approach to risk reduction and mitigation is by attempting to reduce the severity factor. There are several ways in which we can attempt to do this. The first is through the use of monitoring. By monitoring business metrics (checkouts, listings, signups, etc) we can quickly identify if there is a problem. Continuous deployment requires a rigorous approach to monitoring in order to quickly identify the problem and rollback the changes, thus reducing the severity or impact of the problem to your customers.

Another approach to reducing the severity is by pushing code changes to a small set of your users. Ideally this should be done through “swim lanes” but it can also be accomplished manually. In a process that we call “incremental rollout”, you would deploy new code to a small set of your servers (1-5%) and watch for issues. Once you’re satisfied with that there are no issues, roll to a larger set of your servers. Continue this “roll, pause, and observe” cycle until you have the release completely deployed. Teams that employ this strategy often take days to deploye code changes but by doing so they have much less risks of a customer impacting problem.

There are lots of ways to approach risk mitigation and we should continue to add these approaches to our toolkits. Some approaches work better than others based on the team culture and product or service being offered.


1 comment

Battle Captains and Outage Managers

The other day at a client, we were trying to describe what an outage manager does and a term from my time in the military came back to me, battle captain. The best description I could come up with for an outage manager was that they perform the same duties during an outage that a battle captain does for a unit in battle. For those non-military types, a battle captain resides in the tactical operations center (TOC) of a unit and take care of tasks such as tracking the battle, enforcing orders, managing information, and making decisions based on commander’s intent when the commander is unavailable. This is exactly what an outage manager does for an outage – keep track of the outage (timeline), follow up with people to make sure tasks are completed (i.e. investigate logs for errors), makes sure information is retained and passed along, and when the VP of Ops or CTO is briefing the CEO or on the phone with a vendor, the outage manager makes decisions.

From an atricle What Now, Battle Captain? The Who, What and How of the Job on Nobody’s Books, but Found in Every Unit’s TOC by CPT Marcus F. de Oliveira, Deputy Chief, Leaders’ Training Program, JRTC here is the definition of the role:

The battle captain should be capable of assisting the command group in controlling the brigade or battalion. Remember, the commander commands the unit, and the XO is the chief of staff; BUT, those officers and the S3 must rest. They will also get pulled away from current operations to plan future operations, or receive orders from higher headquarters. The battle captain’s role then is to serve as a constant in the CP, someone who keeps his head in the current battle, and continuously assists commanders in the command and control of the fight.

A great battle captain can provide a tactical advantage to units in combat. If you have a great outage manager or have seen one work, you know how important they can be in reducing the duration of the outage. Most outage managers have primary jobs such as managing a shift in the NOC or managing an ops team but when an outage occurs they jump into the role of an outage manager. If you don’t currently have an outage manager junior military officers (JMO) just leaving the service often make great ones.


Comments Off on Battle Captains and Outage Managers

DevOps

What do you call a set of processes or systems for coordination between development and operations teams? Give up? Try “DevOps”. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, but it has recently grown into a discipline of its own. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.” Tracking down the history of the DevOps Wikipedia page, shows that this topic is a recent entry.

There are a lot of other resources on the web that many not have been using this exact term but have certainly been dealing with the development and operations coordination challenge for years.  Dev2Ops.org is one such group and posted earlier this year their definition of DevOps “an umbrella concept that refers to anything that smoothes out the interaction between development and operations.”  They continue in their post highlighting that concept of DevOps is in response to the growing awareness of a disconnect between development and operations. While I think that is correct I think it’s only partially the reason for the recent interest in defining DevOps.

With ideas such as continuous deployment and Amazon’s two-pizza rule for highly autonomous dev/ops teams there is a blurring of roles between development and operations. Another driver of this movement is cloud computing. Developers can procure, deploy, and support virtual instances much easier than ever before with the advent of GUI or API based cloud control interfaces. What used to be clearly defined career paths and sets of responsibilities are now being blended to create a new, more efficient and highly sought after technologist. A developer who understands operations support or a system administrator who understands programming are utility players that are very valuable.

While perhaps DevOps is a new term to an old problem, it is promising to realize that organizations are taking interest in the challenges of coordination between development and operations. It is even more important that organizations pay attention to this topic given the blurring of roles.


1 comment

How To Restore Service in Less Than 5 Minutes

What’s the first thing you do when your site is down? For most people they pull up Nagios, or the like, and check all the servers, databases, and storage systems. Someone else might start tail’ing or grep’ing the log files. Tech executives by now are answering phone calls or sending email updates about the outage and expected downtime. Software developers are called in go over the log files in more detail and network engineers are asked to jump on devices to make sure they are responding properly.

What’s missing from the above scenario? Nobody looked up the last change that went into production. In our experience, 90+% of the problems in production are caused by the latest change, be it a code release, firewall change, or applying DDL or DML to the database. And it’s a sure bet that latest change is the problem if the person who made it says “That couldn’t have caused the outage.” In fact there is probably a high degree of correlation between how emphatically they make their statement and the probability that it is the cause of the incident.

Just the other day one of our friends had an outage call where the network security team was arguing that their latest change could not have possibly caused the outage. Guess what caused the outage…that’s right the firewall change.

So, how do you solve 90+% of your problems in less than 5 minutes? You immediately rollback the last change you made to your production environment. You might be saying to yourself “But how can I do that when I don’t know all the changes that are happening in my production environment?” And that (as Paul Harvey used to say) is the rest of the story.

You have to keep track of every single change that takes place in your production environment. This is called “change tracking” and is different from “change management”. Change tracking is simply keeping track, in any format, of all the changes that happen in production. These changes can be kept in a word document, spreadsheet, database, IRC channel, or even an unmonitored email account. Anything that 1) allows fast entry, so people have no excuse to not use it, and 2) can be retrieved immediately when needed during an outage.


1 comment

Morning Operations Meeting

If you deliver a service through software, you need to discuss your service delivery quality every day! Here's how:

You get paid to deliver a service.  You want to deliver that service to the level of your customers’ expectations, or at least to some internally defined level.  So how often do you meet to discuss your service delivery quality?

In our experience, most companies meet only when there is a problem.  Day in and day out many Software as a Service companies will operate their services throughout the day and simply not take the time to step back and look at the last day’s worth of issues, all open issues that are not yet resolved and diagnose service delivery problems.  How could this be you ask?  Well, we honestly don’t know!

As we’ve written before, if you are a SaaS company your business is predicated first and foremost on SERVICE DELIVERY!  Developing software is important – but what makes you money is the delivery of a service.  Get this straight folks, because it is a major mind shift.

In our view, it is absolutely critical to start the business day with a review of the past day’s service delivery.  We call this the “Morning Operations Meeting” or “Morning Operations Review”.  Every day we ask our clients to review major issues from the previous day, overall service quality (response times, availability, major interruptions or bugs live on the site, etc), and all major open issues identified in past days.  Ideally the notion of an incident (a thing that happens in production and causes customer complaint) and the notion of a problem (a thing that causes an incident) are separated in this meeting.  Both should be discussed – but they are really two separate things.

Ideally this meeting will have representatives from your customer support organization, technical operations and infrastructure teams and software development teams.  Inputs to the meeting are a representation of customer complaints, complaints regarding service within the company, manual identification of issues, automated identification of issues (such as through a monitoring system to include Service Level metrics), predictive identification of future problems (such as might be the case from a capacity management team) and all appropriate service level information.

Open incidents and problems from the issue tracking system are discussed, updated, etc.  Owners are assigned to new incidents and problems (if they haven’t been already) and new issues are updated if any were missed from the previous day’s operations.

Outputs from the meeting are updated service level reports, scheduling of post mortems for large incidents, updated problem reports and data for monthly or quarterly look backs or reviews (more on this later).

If done well, the morning meeting helps inform architectural changes that are necessary in the scalability summits or in other product development and architecture meetings.  Recurring problems should be easily identified within the issue management as a result of heightened oversight and analysis of the system.


2 comments

Delayed Replication

Do you think your database replica will save your data in a disaster? Think again because there are a lot of scenarios that will cause you to corrupt all your data.

Recently on the MySQL Performance Blog they had a post that did a great job explaining a problem that we often try to warn our clients about. The crux of the problem is that if you are relying only on a replica for disaster recovery then you are going to lose data when something bad happens.

For minimizing the impact of eventual consistency in our BASE applications, we want our replicas to be very near real time. This unfortunately can be unintended consequences in a disaster. Whether you’re relying on MySQL’s statement-based replication or Oracle’s redo apply replicating at the block-level, both are vulnerable to data corruption.

Any scenario resulting in data corruption on the primary will immediately be replicated to the standby. If a DBA drops a table by the time he stops cursing the drop table has been replicated to the standby. Storage subsystem or HA failover both can corrupt data files which can get propagated to the standby.

The solution to this problem is to create a standby or replica that has a delay on applying the log files. We recommend between 6 – 12 hours delay which gives you plenty of time to catch a logical corruption and stop the replication. You don’t need a large production sized server for this since you’ll never use this database in production but simply recover the database from it. Do this simple thing and it might save your data.


3 comments

Probability

Imagine your team has just pushed a hot fix for a problem.  Once the first 24hrs has passed do you relax?  How many days of not seeing the problem do you conclude the fix worked? Let’s start by discussing coin tosses.

Assuming you have a fair coin, that is just as likely to land heads as tails, the probability of getting a heads on a single flip is 50%.  Now two questions.  First, what is the chance of getting two heads in a row?  Second, what is the chance of the next flip being heads?  While these two questions seem similar they are very different.  To answer the first question, two heads in a row, we can look at all the possible combinations of two coin tosses:

(H,H) (H,T) (T,H) (T,T)

With four possible outcomes one of which is our two heads (H,H), we can easily compute that we have a 25% chance of getting two heads in a row. Another way of computing this is by multiplying the probability of getting a head on each coin toss. Because each toss is independent the likelihood of getting a head is 50% for each. We therefore have 50% * 50% = 25%.

This gets us to the second question, what is the probability that the next flip is a heads. As mentioned above, and contrary to what gamblers and sportscaster often believe, each flip is independent there is no “law of averages” that would indicate heads or tails is more likely. The first flip and the second flip and the third and each subsequent flip are independent of each other. However, we do expect that as the number of flips gets large the porportion of heads and tails approaches 50%. Another way to look at this is to go back to our diagram above.  If our first flip was a heads which of the scenarios could exist for our second flip? The answer is the scenarios of (H,H) (H,T) because they both have ‘H’ as their first flip.  Therefore the chance that the second flip is a tails is 1 out of 2 or 50%.

Back to our hot fix scenario. Let’s ignore the probability that our fix actually solved the problem and just focus on the likelihood of a problem occurring each day this week. We can get into prior probabilities in a later post. As we discussed above, independent events maintain their probability for each event so the probability of the problem occuring today is 50%, that it occurs tomorrow is 50%, and so on. A different question is, what is the probability that the problem will not occur three days in a row? Let’s first look at this visually using N = No Problem and P = Problem.

(N, N, N) (N, N, P) (N, P, N) (P, N, N) (N, P, P) (P, N, P) (P, P, N) (P, P, P)

From the figure above we can see that there is only 1 out of 8 scenarios where we have No Problem three days in a row. This equates to 1/8 = 0.125 or 12.5%. We can also solve this as we did above by multiplying our probability each day 50% * 50% * 50% = 12.5%.

One final note about independent failure events. We often tell clients to avoid synchronous calls because if one service fails (because of hardware or software) it causes others to fail. If you have 99% uptime on one service and 99% uptime on another but they are both required to service a request, the total system availability is 99% * 99% = 98% unless of course they happen to fail at the exact same time every time.  This is what we call the multiplicative effect of failure.


1 comment