AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » monitoring

Crisis Management – Normal Accident Theory and High Reliability Theory

The partial meltdown of TMI-2 at Three Mile Island in 1979 is one of the best known crisis situations within the US and was the source of several books, and at least one movie.  It also generated two theories relevant to crisis management.

Charles Perrow’s Normal Accident Theory (NAT), described in his book Normal Accidents, states that the complexity inherent to tightly coupled technology systems makes accidents inevitable.  Perrow’s hypothesis is that the tight coupling causes interactions to escalate rapidly and without obstruction.  “Normal” is a nod to the inevitability of such accidents.

Todd LaPorte, who founded the Berkeley school of High Reliability Theory, believes that there are organizational strategies to achieve high reliability even in the face of such tight coupling.  The two theories have been debated for quite some time.  While the authors don’t completely agree as to how they can coexist (LaPorte believes that they are complimentary and Perrow believes that they are useful for the purposes of comparison), we believe there is something to be gained from them.

One paradox from these debates becomes intuitively obvious to our pursuit of high availability and highly scalable systems:  The better we are at building systems that avoid problems and crises, the less practice we have in solving problems and crises.  As the practice of resolving failures are critical to our learning, we become more and more inept at rapidly resolving these failures as their frequency decreases.  Therefore, as we get better at building fault tolerant and scalable systems, we get worse at resolving crisis situations that are almost certain to happen at some point.

Weick and Sutcliffe have a solution to this paradox that we paraphrase as “organizational mindfulness”.  They identify 5 practices for developing this mindfulness:

1)      Preoccupation with failure.  This practice is all about monitoring IT systems and reporting errors in a timely fashion.  Success, they argue, narrows perceptions and breeds overconfidence.   To combat the resulting complacency, organizations need complete transparency into system faults and failures.  Reports should be widely distributed and discussed frequently such as in our oft recommended “operations review” process outlined within the Art of Scalability.

2)      Reluctance to simplify interpretations.  Take nothing for granted and seek input from diverse sources.  Don’t try to box failures into expected behavior and act with a healthy bit of paranoia.

3)      Sensitivity to operations.  Look at detail data at the minute level, such as we’ve suggested in our posts on monitoring.  Include the usage of real time data and make ongoing assessments and continual updates of this data.  We think our book and our post on monitoring strategies have some good suggestions on this topic.

4)      Commitment to resilience.  Build excess capability by rotating positions and training your people in new skills.  Former employees of eBay operations can attest that DBAs, SAs and Network Engineers used to be rotated through the operations center to do just this.  Furthermore, once fixes are made the organization should be quickly returned to a sense of preparedness for the next situation.

5)      Deference to expertise.  During crisis events, shift the leadership role to the person possessing the greatest expertise to deal with the problem.  Our book also suggests creating a competency around crisis management such as a “technical duty officer” in the operations center.

We would add that every operations team should use every failure as a learning opportunity, especially in those environments in which failures are infrequent.  A good way to do this is to leverage the post mortem process.



This time we have a guest post from a long time friend and colleague, Chris Lalonde. During a conversation a couple of weeks ago Chris told us what he was doing with a product called Splunk, to provide monitoring, alerting, and visibility into log files. Given the importance that we place on logging and monitoring, what Chris was doing sounded interesting. We asked Chris if he would be willing to share a little about his implementation. Chris has over 15 years of experience in information technology and has provided technology solutions for Fortune 1000 companies such as eBay, TD Bank as well as government agencies like DoD, RCMP and USSS. He holds a bachelor of mechanical engineering with a concentration in robotics from Carleton University. Chris also has three patents for authentication systems and has several others pending. He was the recipient of the Director’s Award from the United States Secret Service. And now from Chris:

Having worked in technology for 15+ years I understand how challenging it can be to get visibility across your entire platform, apparently the folks who started Splunk understood this as well. What is Splunk, well it’s a tool you can use to collect data from virtually any server in virtually any format and carry out alerting, ad-hoc searches and large scale reporting.

You install the Splunk client on each of the servers you want to monitor. Currently Splunk supports Windows, Linux, OSX, Solaris, BSD and AIX so the vast majority of platforms are covered. The client typically installs in a few minutes and if you spend a little time pre-configuring things it’s easy to automate the install. Once it’s installed you have a few options you can either run a full server on each machine or turn them into lightweight forwarders and consolidate your logs. I’d recommend the latter since it gives you the log aggregation you more than likely want and the light weight clients use fewer resources than the full blown server. Additionally you can point traditional data sources at Splunk, have lots of syslogs running around why not point them at your central Splunk server and get them indexed and searchable as well.

Once you’ve got Splunk installed on your servers the uses are pretty much endless. The simplest and most obvious is system/log monitoring and alerting. You can just point Splunk at a log directory and it will automatically analyze and index all the files in that directory. Spunk will also attempt to parse the elements of the files so you can do searches on virtually every element of your file. Need to know how many times that URL was called with variable “foo” in it? Just go into the search app and type in something like “http://” “foo” and all the results are shown in two ways 1) in a graphical timeline showing the count per unit of time (hrs, days, min) and 2) in detail below the graph where you can expand out and literally get the section of the specific log file with those results.

That’s the simple version, let’s try something more interesting. Say you automatically roll code to your site and that you’d like a way to verify that code rolling tool did its job. Well just point Splunk at the directory your code rolls and configure Splunk to monitor changes to that directory and bingo once the code is in place Splunk will see the file changes. Now you can either manually search for those files or have Splunk send you an alert with the list of files that have changed in those directories.

Not enough? Splunk has a PCI compliance suite that covers all twelve PCI DSS requirements and all 228 sub-requirements including live controls monitoring, process workflow, checklists and reporting. How about indexing firewall logs, yes. How about sending those IDS logs to Splunk, sure it’ll swallow those as well. Would you like to get an alert when say you suddenly get a large increase in logs from a firewall or and IDS? Sure no problem.

Ok well that’s all great for the System Administrators, Security folks and build and release but how about the network folks. Sure, Splunk has you covered as well. Have some Cisco gear? Splunk has a Cisco suite that covers that. How about something for the DBAs, yes, MySQL and Oracle are covered as well. Again all this data is indexed and is now not only searchable, but you can create customer dashboards and alerts off of all that data.

You are probably saying “Yes I’m sure you can do that but it’s probably a nightmare to configure” in fact nothing could be further from the truth. You can use the UI to configure the report in less than 5min. If you don’t like to use a GUI, the same thing can be done via the CLI in about the same amount of time.

Ok for the bad news, Splunk is free up to the first 500MB of indexes per day after that it starts costing money also the free version is missing some of the reporting extras. 500Mb sounds like a lot until you actually start indexing things, also by default Splunk indexes many, many things that you might not want so you need to be very clear about what you do and what you do not want it to index. There are several ways of dealing with the indexes such as either by aging data out more quickly or limiting the size of the indexes. Currently I’m restricting index sizes so that I’m keeping about 30days worth of data and I haven’t had any issues.

Having said all of this I haven’t yet explored the limits of Splunk in my current configuration I’m only indexing data from about 50 servers and I’ve not run into any issues. I should note that Splunk is designed to scale horizontally and I know of people indexing data from thousands of boxes so I’m not expecting there to be a scale issue. In my experience the data is indexed in less than 5 min so I am currently using Splunk as part of our monitoring and alerting system and it has helped identify issues that would have otherwise been hidden in the logs.

Splunk saved me having to build a centralized logging infrastructure plus all the tools one needs to monitor and manage that infrastructure it’s allowed me to instantly search across all our logs and systems to identify system issues, db issues, and code issues. Something that would have taken me weeks took me 1 day to install and get working on 30 boxes and within 1 week I’d found critical events that saved my company $$.

NOTE: Thanks to Steve from Splunk for identifying that it is 500MB of indexes per day.


Fix Your Bugs

Most of you should be familiar with the Microsoft Error Reporting service. If you are not, this is a service that when an error occurs in an application running on a Microsoft Operating System, such as Vista, it offers to report the problem to Microsoft in order that they “improve” your experience. What’s interesting about this service is the data.  They have undoubtedly gathered millions of errors over the years and have some pretty interesting insight into application errors. What I found most compelling is that if you only 1% of the bugs you will improve the experience for 50% of the users. 


I’m not sure if this error / customer impact rate extends perfectly to Web 2.0 or Software as a Service applications but I suspect it is not off by much. If you don’t mine your application’s error logs, you’re missing out on a plethora of insight into not only your application but more importantly your users’ experience.  Unless the error is coming from an offline process each error or set of errors is resulting in a frustrated user.  

We’ve talked in the past about monitoring your application, how much logging is necessary, and not relying on your customers to find problems. Custom application monitoring such as with SCAMP is ideal. However, unless you’ve turned off all logging you should still have web and app server logs to parse through starting today. There are lots of open source log parsers such as AWStats or Webalizer or if you’re the NIH-type there is always the option of building something custom using MySQL or Hadoop. 

Start today, looking through your log files for the top five errors and file bugs to have them fixed before the next release goes out the door. Make investigating log files part of your process especially after releases. Just simply the number of errors logged should give you some indication of the application’s performance compared to previous versions. Your customers will thank you for it.

Comments Off on Fix Your Bugs