Two of our previous articles, Splitting Databases for Scale and Splitting Applications or Services for Scale have made references to a concept that we call “Swimlaning Architectures”. We sometimes also call “swimlanes” fault isolation zones or fault isolated architecture.
The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or approach in technology architecture.
In our definition, a “Swimlane” is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. The benefit of such a failure domain is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.
A “swimlaned” architecture is one in which each failure domain is completely isolated. In order to achieve this, ideally there are no calls between swimlanes or failure domains. Synchronous calls are absolutely forbidden in this type of architecture as any synchronous call between failure domains, even with appropriate timeout and detection mechanisms is very likely to cause a series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a call to any other service in another domain, to any service outside of the domain, or if the domain receives calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains. If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As we have previously indicated, a swimlane should have all of its services located within the failure domain. For instance, if database accesses are necessary the database with all appropriate information for that swimlane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swimlane. Furthermore, that database should not be used for other requests of service from other swimlanes. Our rule is one production database on one host.
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swimlaned architectures. You can think about them in terms of a separation of services e.g. “login” and “shopping cart” (two separate swimlanes) each having the web and app servers as well as all data stores located within the swimlane and answering only to systems within that swimlane. Corresponding to the Scale Cube we have previously introduced this would be a “Y” axis swimlane.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog.
Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swimlane along customer, order number or product id lines.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.