As a frequent traveler who can exceed 250K air miles per year, I experience flight delays almost every week. Waiting for a delayed flight recently allowed me time to think about how airlines experience the multiplicative effect of failure just like our systems do. If you’re not familiar with this phenomenon, think back to holiday lights that your parents put up. If any one bulb burnt out none of the string of lights worked. The reason was that all the 50 lights in the string were wired in series, requiring all of them to work or none of them worked. Often our systems are designed such that they require many servers, databases, network devices, etc. to work for every single request. If any device or system is unavailable in that request-response chain, the entire process fails. The problem with this is that if five devices each have 99.99% availability, the total availability of the overall system is that availability raised to the power of the number of devices, i.e. 99.99%^5 = 99.95%. This has become even more of a problem with the advent of micro-services (see our paper on how to fix this problem, http://reports.akfpartners.com/2016/04/20/the-multiplicative-effect-of-failure-of-microservices/). Now, let’s get back to airlines.
Airlines spend a lot on airplanes, the average Boeing 737-900 costs over $100M (http://www.boeing.com/company/about-bca/index.page%23/prices), and they want to make use of those assets by keeping them in the air and not on the ground. To accomplish this, airlines schedule airplanes on back-to-back trips. For example, a recent United flight departing SFO for SAN was being serviced by an Airbus A320 #4246. This aircraft had the following schedule for the day.
If I’m on the last flight of the day (SFO->SAN), each previous flight has to be “available” (on-time) in order for me to have any chance of my request (flight) being serviced properly (on-time). As you can undoubtedly see, this is aircraft schedule (SFO->SAN->SFO->SEA->SFO->SAN) is very similar to a request-response chain for a website (browser->router->load balancer->firewall->web server->app server->DB). The availability of the entire system is the multiplied availability of each component. Is it any wonder why airlines have such dismal availability? See the chart below for Aug 2016, with El Al coming in at a dismal 36.81%, my preferred airline, United, coming in at 77.88%, and the very best KLM at only 91.07%.
So, how do you fix this? In our systems, we scale by splitting our systems by either common attributes such as customers (Z-axis) or different attributes such as services (Y-axis). In order to maintain high availability we ensure that we have fault isolation zones around the splits. Think of this as taking the string of 50 lights, breaking it into 5 sets of 10 lights and ensuring that no set affects another set. This allows us to scale up by adding sets of 10 lights while maintaining or even improving our availability.
Airlines could achieve improved customer perceived on-time reliability by essentially splitting their flights by customers (Z-axis). Instead of large airplanes such as the Boeing 737-900 with 167 passengers, they could fly two smaller planes such as the Embraer ERJ170-200 each with only 88 passengers. When one plane is delayed only half the passengers are affected. Airlines are not likely to do this because in general the larger airplanes have a lower cost per seat than smaller airplanes. The other way to increase on-time reliability would be to fault isolate the flight segments. Instead of stacking flights back-to-back with less than one hour between arrival and the next scheduled departure, increase this to a large enough window to makeup for minor delays. Alternatively, at hub airports, introduce a swap aircraft into the schedule to break the dependency chain. Both of these options are unlikely to appeal to airlines because they would also increase the cost per seat.
The bottom line is that there are clearly ways in which airlines could increase their on-time performance but they all increase costs. We advise our clients, that they need the amount of performance and availability that improves customer acquisition or retention but no more. Building more scalability or availability into the product than is needed takes resources away from features that serve the customer and the business better. Airlines are in a similar situation, they need the amount of on-time reliability that doesn’t drive customers to competitors but no more than that. Otherwise the airlines would have to either increase the price of tickets or reduce their profits. Given the dismal on-time reliability, I just might be willing to pay more for a better on-time reliability.
Comments Off on Why Airline On-Time Reliability Is Not Likely To Improve