This is one of several articles on recommended architectural principles and goes into deeper depth to our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” or more commonly – "Swim lanes" or "Swim-laned Architectures". We sometimes also call "swim lanes" fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of the said boundary. Think of this as the "blast radius" of failure meant to answer the question of "What gets impacted should any service fail?" The benefit of fault isolation is twofold:
- Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
- Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The "blast radius" of failure is contained. As such, and depending upon approach, only a portion of users or a portion of the functionality of the product is affected. This is akin to circuit breakers in your house – the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker tripping, preserving power to devices which are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations, similar to how a floating line of buoys keeps each swimmer in his or her lane during a race. In order to achieve this in systems architecture, ideally there are no synchronous calls between swimlanes or failure domains made pursuant to a user request.
User-initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with an appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and web servers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn't meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with the AKF Scale Cube in the past, there are many ways in which to think about swimlaned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-Axis would mean replicating everything for high availability – and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost-effective solutions for recovering from a disaster.
Fault Isolation in the Y-Axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) with each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to its swim lane.
The example above of a commerce site shows different components of the page broken down into sections for login, buy again, promotions, shopping cart, and checkout. Each component would reside within separate applications, hosted on different servers with properly isolated services.
Another approach would be to perform a separation of your customer base or separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z-Axis swimlane along customer, order number, or product ID lines. More beneficially, if we are interested in the fastest possible response times to customers, we may split along geographic boundaries with each pointing to the closest data center within that region. Besides contributing to faster customer response times, these implementations can also help ensure we are compliant with data sovereignty laws (GDPR for example) unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault-isolated swim lane solutions. Send us a note - we'd love to help you!