I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”
As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.
Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.
As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.