Our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” and sometimes “Swim lanes” or “Swim-laned Architectures”. We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “Swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of a failure is contained. As such, and depending upon approach, only a portion of users or a portion of functionality of the product is affected. This is akin to circuit breakers in your house - the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker popping and other devices are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations. In order to achieve this, ideally there are no synchronous calls between swim lanes or failure domains made pursuant to a user request. User initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services. Again, “synchronous” is meant to identify a synchronous call (call, wait for a response) pursuant to any user request.
It is acceptable, but not advisable, to have asynchronouss calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swim laned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-axis would mean replicating everything for high availability - and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost effective solutions for recovering from disaster.
Fault Isolation in the Y-axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to it’s swim lane.
While purposely not legible (fuzzy) the fake example above shows different components of a fictional business account from a fictional bank. Components of the page are separated with one component showing a summary, another component displaying more detailed information and still other components showing dynamic or static links - each derived from properly isolated services.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swim lane along customer, order number or product id lines. More beneficially, if we are interested in fastest possible response times to customers, we may split along geographic boundaries. We may have data centers (or IaaS regions) serving the West and East Coasts of the US respectively, the “Fly-Over States” of the US, and regions serving the EU, Canada, Asia, etc. Besides contributing to faster perceived customer response times, these implementations can also help ensure we are compliant with data sovereignty laws unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve a high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault isolated solutions. Send us a note - we’d love to help you!