Archive for the ‘Operations’ Category

Crisis Management – Normal Accident Theory and High Reliability Theory

Wednesday, November 18th, 2009

The partial meltdown of TMI-2 at Three Mile Island in 1979 is one of the best known crisis situations within the US and was the source of several books, and at least one movie.  It also generated two theories relevant to crisis management.

Charles Perrow’s Normal Accident Theory (NAT), described in his book Normal Accidents, states that the complexity inherent to tightly coupled technology systems makes accidents inevitable.  Perrow’s hypothesis is that the tight coupling causes interactions to escalate rapidly and without obstruction.  “Normal” is a nod to the inevitability of such accidents.

Todd LaPorte, who founded the Berkeley school of High Reliability Theory, believes that there are organizational strategies to achieve high reliability even in the face of such tight coupling.  The two theories have been debated for quite some time.  While the authors don’t completely agree as to how they can coexist (LaPorte believes that they are complimentary and Perrow believes that they are useful for the purposes of comparison), we believe there is something to be gained from them.

One paradox from these debates becomes intuitively obvious to our pursuit of high availability and highly scalable systems:  The better we are at building systems that avoid problems and crises, the less practice we have in solving problems and crises.  As the practice of resolving failures are critical to our learning, we become more and more inept at rapidly resolving these failures as their frequency decreases.  Therefore, as we get better at building fault tolerant and scalable systems, we get worse at resolving crisis situations that are almost certain to happen at some point.

Weick and Sutcliffe have a solution to this paradox that we paraphrase as “organizational mindfulness”.  They identify 5 practices for developing this mindfulness:

1)      Preoccupation with failure.  This practice is all about monitoring IT systems and reporting errors in a timely fashion.  Success, they argue, narrows perceptions and breeds overconfidence.   To combat the resulting complacency, organizations need complete transparency into system faults and failures.  Reports should be widely distributed and discussed frequently such as in our oft recommended “operations review” process outlined within the Art of Scalability.

2)      Reluctance to simplify interpretations.  Take nothing for granted and seek input from diverse sources.  Don’t try to box failures into expected behavior and act with a healthy bit of paranoia.

3)      Sensitivity to operations.  Look at detail data at the minute level, such as we’ve suggested in our posts on monitoring.  Include the usage of real time data and make ongoing assessments and continual updates of this data.  We think our book and our post on monitoring strategies have some good suggestions on this topic.

4)      Commitment to resilience.  Build excess capability by rotating positions and training your people in new skills.  Former employees of eBay operations can attest that DBAs, SAs and Network Engineers used to be rotated through the operations center to do just this.  Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.

5)      Deference to expertise.  During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem.  Our book also suggests creating a competency around crisis management such as a “technical duty officer” in the operations center.

We would add that every operations team should use every failure as a learning opportunity, especially in those environments in which failures are infrequent.  A good way to do this is to leverage the post mortem process.

VP of Operations

Monday, November 16th, 2009

One of the most common questions we get from individuals is “what is the path to becoming a CTO?” We posted about this before and focused on the skill sets required as opposed to the path to get there.  We highlighted 1) good knowledge of business in general 2) great technical experience 3) great leadership 4) great manager 4) great communicator and 5) willing to let go.  This time we’re going to one of the jobs that is often a stepping stone to the CTO job.

The VP of Operations is the person who leads the Technology Operations or Production Operations team.  This team has responsibility for running the hardware and software systems of the company. For SaaS or Web2.0 companies this is the revenue generating systems. For corporate IT this is the ERP, CRM, HRM, etc. This team is often comprised of project managers, operations managers, and technical leads. As the head of the Operations team the VP of Operations has responsibility for monitoring, escalating, managing issues, and reporting on availability, capacity, and utilization. Incident and problem management as well as root cause analysis (postmortem) are some of the most important jobs that their team accomplishes. In order to perform this role well the VP of Operations must have good process skills, a strong leadership presence, able to remain calm under fire, and goof overal knowledge of the system.

The VP of Operations is often also responsible for the Infrastructure team. This team is usually comprised of system administrators, database administrators, and network engineers. This team procures, deploys, maintains, and retires systems. As the head of this team the VP of Operations has requirements for budgeting, balancing time between longer term projects and daily operations on the systems. This team understands the system holistically and are often the most useful when performing scalability summits. In order to perform this role well, the VP of Operations must have a good understanding of each of the technical roles that this team is responsible for, including the databases, operating systems, and the network. This doesn’t mean in order to succeed in this role a person must be able do each of these jobs but they do need a good, solid understanding in order to converse, brainstorm, debate, and make decisions in each of these technical realms.

If you compare this list of skills that we mentioned at the top of this post with those mentioned as necessary to succeed as the VP of Operations you’ll see they overlap a good deal. Great technical experience, great leadership, and great management skills will serve you well as the head of operations and will also go a long way to developing most of the skills you will need as a CTO.

We’re approaching the end of the year, a time that many people and organizations use to reflect on what they have accomplished and what they want to accomplish next year.  A good idea as part of your personal growth is to use the list above and score yourself as honestly as possible in terms of skills.  If you’re missing some of them make sure you have some goals in place that help you acquire a few more of these each year. Do this and not only will succeed one of the important jobs that lead to the CTO job but when you do arrive at the CTO position you will be one of the successful ones.

Storage Headaches

Saturday, February 21st, 2009

There are numerous companies who decided a year or two ago that as part of their product offering to provide storage of user data.  Usually this occurred with no foresight or cost calculations and so these companies decided that this was either unlimited in amount, perpetual in duration, or worse, both.  Fast forward to the present and these companies are scrambling to figure out ways to lower the storage cost or charge customers for this service.  Of course, hindsight is 20/20 but in our opinion this should be taken as a lesson to all companies that product roadmaps without consideration of the revenue versus cost equation is more than likely to result in future problems of features either not being used by customers or the use of the feature not generating enough revenue to cover the cost.  

 

 

For companies with data storage problems our recommendations are very dependent on their business model, user agreements, customer contracts, etc. So unfortunately there is no panacea or one size fits all solution. In general we usually walk down the follow steps attempting to achieve an acceptable solution:

  1. Delete what data you can
  2. Archive to very low cost storage data that is not being accessed
  3. Establish tiers of storage based on speed, reliability, and availability

Consider situations in which you have a significant amount of archival data such as former employees or customers who are no longer active.  The cost of keeping this on your primary storage is not only the space on your fastest and most expensive storage but also the backup and archiving of this data that occurs every day even though it never changes.  Incremental backups help this but more than likely you have full backups periodically as well.  If this data is in a primary database, you are likely to have one or more standby databases as well as a tape backup.  All of that unchanging and rarely accessed data continues to take up storage and bandwidth to move it around.  

Possible storage alternatives include the myriad of SAN offerings, NAS devices, open source storage, SATA drive farms, tape, and cloud storage.  We recommend that you implement one or more of these in your solution depending upon your particular needs.  We also encourage you to consider ahead of time your need for scalability and availability.  For a sample architecture of a scalable read or search subsystem check out our previous article.

A Framework for Maturing SaaS Monitoring

Tuesday, September 9th, 2008

Far too often we see clients attempting to implement monitoring solutions intended to tell them the root cause of any potential problem they might be facing.  This sounds great, but this monitoring panacea rarely works and the failures are largely attributed to two issues:
1) The systems they are attempting to monitor aren’t designed to be monitored.
2) The company does not approach monitoring in a methodical evolutionary fashion.

Designing Systems to be Monitored
Honestly, you should not expect a monitoring system to correctly identify the faults within your platform if you did not design your platform to be monitored with near-real time fault detection in mind.  This goes beyond logging events and errors; it is something that we often refer to as “real time application monitoring”. 

The best designed SaaS systems build the monitoring of their platform into their code and systems.  As an example, world class real time monitoring solutions have the capability to log the times and errors for each internal call to a service.  Here the service may be a call to a data store or another web service that exposes account information, etc.  The resulting times, rates and types of errors might be plotted in real time in a statistical process control chart (SPC) with out of bound conditions highlighted as an alert on some sort of monitoring panel.  The mean of the SPC chart may be calculated by the previous 30 similar calendar days (for instance the previous 30 Mondays) for that time of the day (say 12:10 PM).

Additionally, world class teams include an architectural principle addressing the need to be monitored as a criterion for release for any new functionality.  ARB is a process or meeting in which the criterion is evaluated.  Questions such as “How will we know the system is functioning properly” are asked, and a bad answer is one that sounds like “Because we log errors to a log file” whereas a good answer might be “Because we plot the rate of errors and timeliness of responses in real time and alert on statistically significant anomalies”.

Maturing Monitoring
While having “Designed to be Monitored” as a architectural principle is necessary to be world class, it is not sufficient if you really want to resolve issues quickly.  The only silver bullet for monitoring solutions that help quickly identify and resolve issues is a combination of time, planning and a reaction to past events.

First you should plan a system that identifies that something is wrong from the perspective of your customer.  In this step you are answering the question of “Is there a problem my customers can see?”  Far too many companies bypass this step.  Incorporate a real time, third party system that interacts with your platform in the same fashion as your customers – from the “last mile” – and performs your most critical transactions.   Throw an alert when the system is outside of your internally generated SLAs. 

The next step is to implement systems that answer the question of “which systems are causing the problem”?.  In the ideal world you will have developed a fault isolative architecture to create “failure domains” that will isolate failures and help you determine the systems causing the problem.  Failing that, you need monitoring that can help indicate the rough areas of concern.  These are typically aggregated system statistics and monitoring similar to the real time application monitoring above (susbsystem X is throwing errors at a rate 3 standard deviations above normal) or aggregated load, cpu, etc for a group of systems (rather than a single system).  You want to ensure that this level of monitoring does not create a level of noise that forces your team to ignore the alerts.

The third step is to answer the question of “What exactly is the problem”.  This is the step that everyone immediately jumps to when they implement a host of alarms and monitors on everything from individual application logs to individual load, cpu utilization, memory utilization, port utilization, etc.  The problem with this is that these alerts have a high degree of false positive and aren’t necessarily useful in determining that there is a problem that needs to be resolved RIGHT NOW – they are more useful in helping to isolate and determine what the problem is.  If you alert based on aggregate subsystem and customer perceived data, you will have less noise in general and you can use this level of data to help pinpoint the problem, perform capacity analysis, etc.

The final step is to implement monitoring systems that help you identify that there will be a problem in the future.  This is the most mature step, but one that should be tackled only after you’ve implemented the prior three steps to include real time application monitoring.  These systems are predictive in nature and should use data collected from the third level of maturity (discrete and granular system monitoring) to feed into a modeling program that can ultimately help plan capacity, determine system break points, etc.