AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Fault Isolative Architectures or “Swimlaning”

Two of our previous articles, Splitting Databases for Scale and Splitting Applications or Services for Scale have made references to a concept that we call “Swimlaning Architectures”.

The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or approach in technology architecture.

In our definition, a “Swimlane” is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. The benefit of such a failure domain is two-fold:

1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.

2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.

A “swimlaned” architecture is one in which each failure domain is completely isolated. In order to achieve this, ideally there are no calls between swimlanes or failure domains. Synchronous calls are absolutely forbidden in this type of architecture as any synchronous call between failure domains, even with appropriate timeout and detection mechanisms is very likely to cause a series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a call to any other service in another domain, to any service outside of the domain, or if the domain receives calls from other domains or services.

It is acceptable, but not advisable, to have asynchronous calls between domains. If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.

As we have previously indicated, a swimlane should have all of its services located within the failure domain. For instance, if database accesses are necessary the database with all appropriate information for that swimlane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swimlane. Furthermore, that database should not be used for other requests of service from other swimlanes. Our rule is one production database on one host.

As we have indicated with our Scale Cube in the past, there are many ways in which to think about swimlaned architectures. You can think about them in terms of a separation of services e.g. “login” and “shopping cart” (two separate swimlanes) each having the web and app servers as well as all data stores located within the swimlane and answering only to systems within that swimlane. Corresponding to the Scale Cube we have previously introduced this would be a “Y” axis swimlane.

Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swimlane along customer, order number or product id lines.

Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.

Comments RSS TrackBack 4 comments

  • Abbott, Keeven, Fisher &#038 Fortuna Consulting

    in October 2nd, 2008 @ 14:34

    […] The next step is to implement systems that answer the question of “which systems are causing the problem”?.  In the ideal world you will have developed a fault isolative architecture to create “failure domains” that will isolate failures and help you determine the systems causing the problem.  Failing that, you need monitoring that can help indicate the rough areas of concern.  These are typically aggregated system statistics and monitoring similar to the real time application monitoring above (susbsystem X is throwing errors at a rate 3 standard deviations above normal) or aggregated load, cpu, etc for a group of systems (rather than a single system).  You want to ensure that this level of monitoring does not create a level of noise that forces your team to ignore the alerts. […]

  • Top 10 Internet Startup Scalability Killers – GigaOM

    in December 20th, 2009 @ 20:50

    […] failures in certain components don’t impact other zones of functionality. We refer to these fault isolation zones as “swim […]

  • Tweets that mention Fault Isolative Architectures or “Swimlaning” | AKF Partners Blog -- Topsy.com

    in December 22nd, 2009 @ 03:30

    […] This post was mentioned on Twitter by Sergio Bossa, Baronne Mouton. Baronne Mouton said: Fault Isolative Architectures or “Swimlaning” | AKF Partners Blog http://ow.ly/16bU6D […]

  • Revisiting the 1:10:100 Rule | AKF Partners Blog

    in April 28th, 2010 @ 08:03

    […] as it applies to technology architectures.  Swimlaning, as we’ve written about on this blog as well as in the book, is the extreme application of the “shard” or “pod” concept to […]