I was on a flight the other day. Which airline doesn’t matter because this story applies to any of them. The flight was scheduled to depart at 1:30pm. The aircraft was an Embraer ERJ-145 which only holds about 50 passengers. At the request of the flight attendant we hurriedly boarded and shut down our electronic devices so at 1:29pm the door was shut and the jetway was pulled back from the plane. Thus chalking up another “on time departure” for this airline’s metrics. We then sat there for 20 minutes while the pilots recalculated their weight and balance. At one point they reattached the jetway in order to deplane an airline employee who was jump seating. After a few more calculations, they reconsidered and allowed him to re-board. Just as a side note, if we’re voting and the choice is between an extra 200lbs of fuel and allowing an airline employee to jump seat, my vote is on the fuel. After modest delay of about 25 minutes we were on our way.
The thing that irked me was that while they technically might have “departed” on time, from a customer’s perspective they didn’t come anywhere close to that. Teams fall prey to this all the time. The first thing an operations team puts in place is something like Nagios to monitor the CPU, memory, and disk of all the servers. As we discuss in our post Monitoring Strategies, the first measurement to put in place should be something to measure from the customer’s perspective and answer the question “Is there a problem?” The most important thing to know is are my customers being impacted and how. The answer to that will determine who gets paged, how you should react, etc. After answering that then you need to figure out “Where is the problem?”, “What is the problem?”, and “Why is there a problem?”.
Failure to heed this and you’re at risk of falling into the airline metrics trap. You’ll be satisfied that you’ve kept all the servers up and running 99.99% of the time but your customers may have only been able to access the site 95% of the time because of software, networking, database contention, etc. The result is unhappy customers despite you meeting all your stated performance metrics.