GROWTH BLOG: Focus Versus Agility in Business, Product Management and Product Development
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

Architecture Principles: Messaging Systems – Smart End Points, Dumb Pipes

July 29, 2019  |  Posted By: Marty Abbott

Asynchronous messaging systems are a critical component of many highly scalable and highly available architectures.  But, as with any other architectural component, these solutions also need attention to ensure availability and scalability.  The solution should scale along one of the scale cube axes, either X, Y or Z.  The solution should also both include and enable the principle of fault isolation.  Finally, it should scale cost both gracefully and cost effectively while enabling high levels of organizational scale. These requirements bring us to the principle of Smart End Points and Dumb Pipes. 

Fast time to market within software development teams is best enabled when we align architectures and organizations such that coordination between teams is reduced (see Conway’s Law and our white paper on durable cross functional product teams).  When services within an architecture communicate, especially in the case of one service “publishing” information for the consumption of multiple services, the communication often needs to be modified or “transformed” for the benefit of the consumers.  This transformation can happen at the producer, the transport mechanism or the consumer.  Transformation by the producer for the sake of the consumer makes little sense, as the producer service and its associated team have low knowledge of the consumer needs and it creates an unnecessary coordination task between producer and consumer.  Transformation “in flight” by the service similarly implies a team of engineers who must be both knowledgeable about all producers and consumers and an unnecessary coordination activity.  Transformation by the consumer makes most sense, as the consumer has the most knowledge of what they need from the message and eliminates reliance upon and coordination with other teams.  The principle of smart end points and dumb pipes then creates the lowest coordination between teams, the highest level of organizational scale and the best time to market option.

To be successful achieving a dumb pipe, we introduce the notion of a pipe contract.  Such a contract explains the format of messages produced on and consumed from the pipe.  It may indicate that the message will be in a tag delimited format (XML, YAML, etc), abide by certain start and end delimiters, and for the sake of extensibility allow for custom tags for new information or attributes.  The contract may also require that consumption not be predicated on strict order of elements (e.g. title is always first) but rather by strict adherence to tag and value regardless of where each tag is in the message. 

Smart End Points Dumb Pipes Message Contract

By ensuring that the pipe remains dumb, the pipe can now scale both more predictably and cost effectively.  As no transformation compute happens within the pipe, its sole purpose becomes the delivery of the message conforming to the contract.  Large messages do not go through computationally complex transformation, meaning low compute requirements and therefore low cost.  The lack of computation also means no odd “spikes” as transforms start to stall delivery and eat up valuable resources.  Messages are delivered faster (lower latency).  An additional unintended benefit is that because transforms aren’t part of message transit, a type of failure (computational/logical) does not hinder message service availability.

The 2x2 matrix below summarizes the options here, clearly indicating smart end points and dumb pipes as the best choice.

Smart End Points Dumb Pipes Comparison 2x2 Matrix

One important callout here is that “streams processing”, which is off-message platform evaluation of message content, is not a violation of the smart end points, dumb pipes concept.  The solutions performing streams processing are simply consumers and producers of messages, subscribing to the contract and transport of the pipe.

Summarizing all of the above, the benefits of smart end points and dumb pipes are:

  1. Lower cost of messaging infrastructure - pushes the cost of goods sold closer to the producer and consumer.  Allows messaging infrastructure to scale by number of messages instead of computational complexity of messages.  License cost is reduced as fewer compute nodes are needed for message transit.
  2. Organization Scalability – teams aren’t reliant on transforms created by a centralized team.
  3. Low Latency – because computation is limited, messages are delivered more quickly and predictably to end consumers.
  4. Capacity and scalability of messaging infrastructure – increased significantly as compute is not part of the scale of the platform.
  5.  
  6. Availability of messaging infrastructure – because compute is removed, so is a type of failure.  As such, availability increases.

Two critical requirements for achieving smart end points and dumb pipes:

  • Message contracts – all messages need to be of defined form.  Producers must adhere to that form as must consumers.
  • Team behaviors – must assure adherence to contracts.

AKF Partners helps companies build scalable, highly available, cost effective, low-latency, fast time to market products.  Call us – we can help!

Subscribe to the AKF Newsletter

Contact Us

The Circuit Breaker Pattern - Dos and Don'ts

July 8, 2019  |  Posted By: Marty Abbott

Circuit Breaker Pattern Overview

The microservice Circuit Breaker pattern is an automated switch capable of detecting extremely long response times or failures when calling remote services or resources.  The circuit breaker pattern proxies or encapsulates service A making a call to remote service or resource B.  When error rates or response times exceed a desired threshold, the breaker “pops” and returns an appropriate error or message regarding the interface status.  Doing so allows calls to complete more quickly, without tying up TCP ports or waiting for traditional timeouts.  Ideally the breaker is “healing” and senses the recovery of B thereby resetting itself.

Disambiguation

The circuit breaker analogy works well in that it protects a given circuit for calls in series.  Unfortunately, it misses the true analogy of tripping to protect the propagation of a failure to other components on other circuits.  We often use the term circuit breaker in our practice to refer to either the technique of fault isolation or the microservice pattern of handling service to service faults.  In this article, we use the circuit breaker consistent with the microservice meaning.
Microservice Circuit Breaker Overview

Problems the Circuit Breaker Fixes

Generally speaking, we consider service to service calls to be an anti-pattern to be avoided whenever possible due to the multiplicative effect of failure and the resulting lowering of availability.  There are, however, sometimes that you just can't get around making distant calls.  Examples are:

  1. Resource (e.g. database) Calls: Necessary to interact with ACID or NoSQL Solutions.
  2. Third Party Integrations: Necessary to interact with any third party.  While we prefer these to be asynchronous, sometimes they must be synchronous.

In these cases, it makes sense to add a component, such as the circuit breaker, to help make the service more resilient.  While the breaker won't necessarily increase the availability of the service in question, it may help reduce other secondary and tertiary problems such as the inability to access a service for troubleshooting or restoration upon failure.

Principles to Apply

  1. Avoid the need for circuit breakers whenever possible by treating calls in series as an anti-pattern.
  2. When calls must be made in series, attempt to use an asynchronous and non-blocking approach.
  3. Use the circuit breaker to help speed recovery and identification of failure, and free up communication sockets more quickly.

When to use the Circuit Breaker Pattern

  • Useful for calls to resources such as databases (ACID or BASE).
  • Useful for third party synchronous calls over any distance.
  • When internal synchronous calls can't otherwise be avoided architecturally, useful for service to service calls under your control.

Key Takeaways

The circuit breaker won't fix availability problems resulting from a failed service or resource.  It will make the effects of that failure more rapid which will hopefully:

  • Free up communication resources (like TCP sockets) and keep them from backing up.
  • Help keep shared upstream components (e.g. load balancers and firewalls) from similarly backing up and failing.
  • Help keep the failed component or service accessible for more rapid troubleshooting and alerting.
  • Always ensure to have alerts fired on breaker open situations to help aid in faster time to detect (TTD).

AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures.  Give us a call – we can help!

 

 

Subscribe to the AKF Newsletter

Contact Us

Microservice Bulkhead Pattern - Dos and Don'ts

June 27, 2019  |  Posted By: Marty Abbott

Bulkhead Pattern Overview

Bulkheads in ships separate components or sections of a ship such that if one portion of a ship is breached, flooding can be contained to that section.  Once contained, the ship can continue operations without risk of sinking.  In this fashion, ship bulkheads perform a similar function to physical building firewalls, where the firewall is meant to contain a fire to a specific section of the building.

The microservice bulkhead pattern is analogous to the bulkhead on a ship.  By separating both functionality and data, failures in some component of a solution do not propagate to other components.  This is most commonly employed to help scale what might be otherwise monolithic datastores.  The bulkhead is then a pattern for implementing the AKF principle of “swimlanes” or fault isolation.

Bulkhead pattern usage


Problems the Bulkhead Pattern Fixes

The bulkhead pattern helps to fix a number of different quality of service related issues.

  • Propagation of Failure:  Because solutions are contained and do not share resources (storage, synchronous service-to-service calls, etc), their associated failures are contained and do not propagate.  When a service suffers a programmatic (software) or infrastructure failure, no other service is disrupted.
  • Noisy Neighbors:  If implemented properly, network, storage and compute segmentation ensure that abnormally large resource utilization by a service does not affect other services outside of the bulkhead (fault isolation zone).
  • Unusual Demand:  The bulkhead protects other resources from services experiencing unpredicted or unusual demand.  Other resources do not suffer from TCP port saturation, resulting database deterioration, etc.

Principles to Apply

  1. Share Nearly Nothing:  As much as possible, services that are fault isolated or placed within a bulkhead should not share databases, firewalls, storage, load balancers, etc.  Budgetary constraints may limit the application of unique infrastructure to these services.  The following diagram helps explain what should never be shared, and what may be shared for cost purposes.  The same principles apply, to the extent that they can be managed, within IaaS or PaaS implementations.
  2. Bulkhead pattern usage
  3. Avoid synchronous calls to other services:  Service to service calls extend the failure domain of a bulkhead.  Failures and slowness transit blocking synchronous calls and therefore violate the protection offered by a bulkhead.

Put another way, the dimensions of a bulkhead or failure domain is the largest boundary across which no critical infrastructure is shared, and no synchronous inter-service calls exist. 

Anti-Patterns to Avoid

The following anti-patterns each rely on either synchronous service to service communication or sharing of data solutions.  As such, they represent solutions that should not be present within a bulkhead.

When to use the Bulkhead Pattern

  • Apply the bulkhead pattern whenever you want to scale a service independent of other services.
  • Apply the bulkhead pattern to fault isolate components of varying risk or availability requirements.
  • Apply the bulkhead pattern to isolate geographies for the purposes of increased speed/reduced latency such that distant solutions do not share or communicate and thereby slow response times.

AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures.  Give us a call – we can help!

Subscribe to the AKF Newsletter

Contact Us

Command and Query Responsibility Segregation (CQRS): Dos and Don’ts

June 24, 2019  |  Posted By: Marty Abbott

CQRS Overview

The microservice CQRS pattern is most commonly employed to help scale what might be otherwise monolithic datastores.  Per the X-axis of the AKF Scale Cube, a write instance of a datastore receives all creates, updates and deletes (collectively the “commands” within CQRS) while one or more read instances receive reads (the “queries” within CQRS).

Simple Implementations

Most relational (ACID compliant) databases have asynchronous and eventually consistent transfer mechanisms available to create “replicants” or “replica sets” of a “master” database.  These are often called master databases (the write database) and slave databases (the read databases).  The easiest implementation for CQRS then is to rely on native replication technology to create one or more slave databases with the same schema.  The service is then separated into a write service and a read service, each with its own endpoint connected to the appropriate database.

Command and Query Responsibility Segregation Pattern

Many NoSQL solutions also offer similar capabilities.  Attribute names vary by implementation, but the user may identify the number of copies or replica sets for any piece of information.  Further, the user can also often identify how these sets are to be used.  MongoDB, for instance, allows users to identify from which elements of a replica set a read should be performed as well as the level of “synchronicity” between a write and the remaining read elements.  Ideally, this level of synchronicity should be as loose as possible to maximize the value of BASE and avoid the limitations of Brewer’s Cap Theorem.

Advanced Implementations

Highly tuned solutions, or solutions that need to support significant transaction volumes, may benefit from differentiation between the write and read schema instance.  There may be several reasons for this, including:

  • A subset of data that needs to be written compared to that which needs to be read.  For instance, read data may include “static” elements that add no value to the (C)reat, (U)pdate or (D)elete process.
  • An elimination of certain relationships in the read or write schema where those relationships add no value.  Read databases, for instance, may need more relationships than the write in order to perform complex reporting.
  • A reduction or significant difference in indices between those needed for writes (ideally small given that each index needs to be updated for each write, thereby increasing write response time) and those needed for reads.

In these cases, native replication capabilities in either ACID (relational) or BASE (NoSQL) datastores may not fulfill the desired outcomes.  Engineers may need to rely on eventually consistent transfer technologies including queues or buses along with transformation compute capacity. 

Another somewhat “advanced” concept within relational solutions is to use “Master-Master” replication technology as “Master-Slave”.  At AKF, we prefer not to rely on multi-master solutions for distributing writes across multiple database instances.  Very often, “in doubt” transactions can cause a pinging effect against multiple master databases and sometimes result in either starvation or deadlock (the Dining Philosopher’s Problem). 

But if you implement a multi-master replication solution and use it as master-slave we avoid the concerns over transactional coherence and consistency with a single write database.  Should that database fail, however, we can easily “swing” future transactions to the other master previously operating as a slave with a very low probability of conflict.

Benefits of CQRS

  • Scalability of transactions at low development cost and comparatively low conceptual complexity (conceptual cost).
  • Availability/redundancy of solutions.
  • Distribution of X-axis (read copies) geographically for lower latency on read request.

Drawbacks to CQRS

  • Multiple copies of data for scale, leading to potentially higher costs of goods sold
  • Higher number of failures in a production environment (but typically with increasing levels of customer perceived availability).
  • Reads are very often “eventually consistent” as opposed to immediately consistent.  This approach will work for greater than 99.9% of use cases in our experience.

How to Use CQRS

  • Split writes (commands) from reads (queries) both in actions/methods/services and the datastores they use.
  • Implement an eventual consistency solution between the datastores.
  • If implementing more than one read instance, implement a proxy or load balancer with a single service endpoint to “spray” requests across all instances.

What to NEVER do with CQRS

Never employ multiple “write masters” or “command masters” with CQRS where each master shares ownership of the same data elements.

AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures.  Give us a call – we can help!

Subscribe to the AKF Newsletter

Contact Us

Strangler Pattern: Dos and Don’ts

June 12, 2019  |  Posted By: Marty Abbott

Strangler Pattern Overview

The microservice Strangler pattern is employed when teams migrate functionality from an old solution to one or more new implementations.  While the pattern can be used for any old to new service migration, the diagram below depicts the common use case of migrating from a monolithic solution to a microservice implementation:

Depiction of the strangler pattern and progression over time during migration

The solution above starts with a single monolithic application.  Step one is to implement a proxy (or context switch) that can separate requests by type (Y-Axis or Service Segmentation), or if attempting to evaluate the solution for efficacy against an existing implementation, the X-Axis by transaction volume. 
The pattern progresses with the monolith growing increasingly smaller as newly implemented services begin to take requests previously bound for the monolith. 
Finally, the proxy is removed once the service disaggregation is completed.

Benefits of Strangler

  • The Strangler pattern allows for graceful migration from a service to one or more replacement services.
  • If implemented properly, with the ability to “roll back”, the pattern allows relatively low risk in migrating to new service(s).
  • The pattern can be used for versioning of APIs.
  • Similar to versioning above, the pattern can be used for legacy interactions (the old service remains for solutions that aren’t or won’t be upgraded).

Drawbacks to Strangler

  • If implemented with a “new” or additional service – something in addition to a prior proxy or load balancer, the solution decreases availability through the multiplicative effect of failure.
  • Per above, additional (new) services will increase latency.

How to Use Strangler

  • Use the Strangler for versioning or migration to new services.
  • Employ within an existing proxy – either a software-based load balancer (such as Nginx) or a layer 7 capable load balancer/context switch (F5 or Netscaler).
  • If using an existing proxy is not possible, ensure the client is responsible for choosing the resource.
  • Keep rules updated as services are migrated.
  • Prune rules as no longer needed.

What to NEVER do with Strangler

Never employ a new service and increase the call depth with Strangler.  Such a service is often called a “façade”.  Your existing proxy or load balancer solutions likely give you this flexibility today.

Facade anti-pattern used with Strangler Microservice Pattern

Never allow the solution to become a bottleneck or a single point of failure.  Always ensure that the solution is scalable along the X-axis for both availability and scalability.

X-Axis Approach

The X-Axis approach takes a percentage of calls for any unique service endpoint and splits them between the old service and the new service.  This approach is useful to A/B test the solution for end-user efficacy, response time, availability, and cost of operations (cost per transaction).  It also allows for graceful “dialing up” or “dialing down” of the transaction volume (say from 1% to 51%).

Y-Axis Approach

This is the traditional usage for Strangler.  It takes a monolith servicing N unique capabilities and partitions them along either verb/service boundaries (e.g. checkout separated from everything else) or noun/resource boundaries (e.g. catalog related actions from customer data actions).

Best Approach

If time allows, implement both the X and Y approach.  X gives you rollback as well as A/B testing and significantly reduces your risk while Y allows service disaggregation.
AKF Partners has helped hundreds of companies implement and improve microservice based architectures.  Give us a call, we can help you with your transition.

Subscribe to the AKF Newsletter

Contact Us

Sidecar Pattern: The Dos and Don’ts

June 5, 2019  |  Posted By: Marty Abbott

Man on motorcycle with dog in sidecar Shutterstock Image 1084641965 purchased by AKF Partners June 5 2019

Sidecar Pattern Overview

The Sidecar Pattern is meant to allow the deployment of components of an application into separate, isolated, and encapsulated processes or containers.  This pattern is especially useful when there is a benefit to sharing common components between microservices, as in the case of logging utilities, monitoring utilities, configuration routines, etc.

The Sidecar Pattern name is an analogy indicating the one-seat cars sometimes bolted alongside a motorcycle. 

Benefits of Sidecar

Sidecar comes with many benefits:

  • Use of multiple languages (polyglot) or technologies for each component.  This is especially useful if a language is especially strong in a necessary area (e.g. Python for Machine Learning, or R for statistical work) or if an opensource solution can be leveraged to eliminate in-house specialization (e.g. the use of NGINX for certain network-related functions).
  • Separation of what would otherwise be a monolith, and if used properly, fault isolation of associated services.
  • Conceptually easy interactions between components similar to those provided by libraries, or service calls between microservices.
  • Lower latency than traditional service calls to “other” services as the Sidecar lives in the same processing environment (VM or physical server) – albeit typically in a separate container.
  • Similar to the use of libraries, allows for ownership by individual teams and organizational scalability of a larger team.
  • Similar to the use of dynamically-loadable libraries, allows for independent release by teams of various shared usage components.

Drawbacks to Sidecar

Regardless of implementation (poly- or mono- glot), Sidecar has some drawbacks compared to the use of libraries:

  • Higher inter-process communication latency – Because most implementations are service calls, the loopback interface on a system (127.0.0.1) will increase latency compared to the transition of call flow through memory.
  • Size – especially in polyglot implementations but even in monoglot implementations – Containerization leads to multiple copies of similar libraries and increased memory utilization for comparable operations relative to the use of libraries.
  • Environments – it is difficult to create any notion of fault isolation with Sidecar without containerization technologies.  VM technologies (Sidecar in a VM separate from the host or calling solution) is not an option as it is then a Fan Out or Mesh anti-pattern rather than a local call.

When to Use Sidecar

Sidecar is a compelling alternative to libraries for cases where the increase in latency associated with local service messaging does not impact end-user response times.  Examples of these are asynchronous logging, out of band monitoring, and asynchronous messaging capabilities.  Circuit breakers (time-based request/response timeouts) are also a good example of a Sidecar implementation.

When to Avoid Sidecar

Never use a Sidecar Pattern for synchronous activities that must complete prior to generating a user response.  Doing so will add some delay to end-user response times.

AKF also advises staying away from Sidecar for synchronous communications between services where doing so requires Sidecar to know all endpoints for each service.  A specific example we advise against is having every endpoint (instance) of Service A (e.g. add-to-cart) know of every endpoint (instance) of Service B (e.g. decrement-SKU).  A graphic of this example is given below:

Sidecar is useful for several components but do not use it for allowing every endpoint to communicate to every other endpoint

The above graphic indicates the coordination between just two services and the instances that comprise that service.  Imagine a case where all services may communicate to each other (as in the broader Mesh anti-pattern).  Attempting to isolate faults becomes nearly impossible.

If Service A sometimes fails while calling Service B, how do you know which component is failing?  Is it a failure in Service A, Service A’s Sidecar Proxy or Service B?  Easier is to have a fewer number of proxies (albeit at a higher cost of latency given non-local communication) handle the transactions allowing for easier fault identification.

AKF Partners has helped hundreds of companies move from monolithic solutions to services and microservice architectures.  Give us a call, we can help you with your transition.

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: The Service Mesh

May 8, 2019  |  Posted By: Marty Abbott

This article is the sixth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively.  The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor.  The fifth article expands the fuse metaphor from service fuses to data fuses.

Howard Anton, the author of my college Calculus textbook, was fond of the following phrase:  “It should be intuitively obvious to the casual observer….”.  The clause immediately following that phrase was almost inevitably something that was not obvious to anyone – probably not even the author.  Nevertheless, the phrase stuck with me, and I think I finally found a place where it can live up to its promise. The Service Mesh, the topic of this microservice anti-pattern, is the amalgamation of all the anti-patterns to date.  It contains elements of calls in series, fuses and fan out.  As such, it follows the rules and availability problems of each of those patterns and should be avoided at all costs. 

This is where I need to be very clear, as I’m aware that the Service Mesh has a very large following.  This article refers to a mesh as a grouping of services with request/reply relationships.  Or, put another way, a “Mesh” is any solution that violates repeatedly the anti-patterns of “tree lights”, “fuses” or “fan out”.  If you use “mesh” to mean a grouping of services that never call each other, you are not violating this anti-pattern.

What constitutes a service mesh?

What is NOT a service mesh?

The reason mesh patterns are a bad idea are many-fold:

1)  Availability:  At the extreme, the mesh is subject to the equation: [N∗(N−1)]/2.  This equation represents the number of edges in a fully connected graph with N vertices or nodes.  Asymptotically, this reduces to N2.  To make availability calculations simple, the availability of a complete mesh can be calculated as the service with the lowest availability (A)^(N*N).  If the lowest availability of a service with appropriate X-axis cloning (multiple instances) is 99.9, and the service mesh has 10 different services, the availability of your service mesh will approximate 99.910.  That’s roughly a 99% availability – perhaps good enough for some solutions but horrible by most modern standards.

2) Troubleshooting:  When every node can communicate with every other node, or when the “connectiveness” of a solution isn’t completely understood, how does one go about finding the ailing service causing a disruption?  Because failures and slowness transit synchronous links, a failure or slowness in one or more services will manifest itself as failures and slowness in all services.  Troubleshooting becomes very difficult.  Good luck in isolating the bad actor.

3) Hygiene:  I recall sitting through computer science classes 30 years ago and hearing the term “spaghetti code”.  These days we’d probably just call it “crap”, but it refers to the meandering paths of poorly constructed code.  Generally, it leads to difficulty in understanding, higher rates of defects, etc.  Somewhere along the line, some idiot has brought this same approach to deployments.  Again, borrowing from our friend Anton, it should be intuitively obvious to the casual observer that if it’s a bad practice in code it’s also a bad practice in deployment architectures.

4) Cost to Fix: If points 1 through 3 above aren’t enough to keep you away from connected service meshes, point 4 will hopefully help tip the scales.  If you implement a connected mesh in an environment in which you require high availability, you will spend a significant amount of time and money refactoring it to relieve the symptoms it will cause.  This amount may approximate your initial development effort as you remove each dependent anti-pattern (series, fuse, fan-out) with an appropriate pattern.


Microservice Anti-Pattern:  The Service Mesh

Fixing a mesh is not an easy task.  One solution is to ensure that no service blocks waiting for a request to complete of any other service.  Unfortunately, this pattern is not always easy or appropriate to implement.

Microservice Anti-Pattern Service Mesh Fix - Async Interactions

Another solution is to deploy each service as service when it is responding to an end user request, and as a library for another service wherever needed.

Microservice Anti-Pattern Service Mesh Fix - Libraries

Finally, you can traverse each service node and determine where services can be collapsed or any of the other patterns identified within the tree light, fuse, or fanout anti-patterns.


AKF Partners helps companies create scalable, fault tolerant, highly available and cost effective architectures to meet their product needs.  Give us a call, we can help

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Data Fuse

May 8, 2019  |  Posted By: Marty Abbott

This article is the fifth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively.  The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor.

The Data Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed data store.  This data store can be any persistence solution from physical file services, to a common storage area network, to relational (ACID) or NoSQL (BASE) databases.  When the shared data solution “C” fails, service A and B fail as well.  Similarly, when data solution “C” becomes slow, slowness under high demand propagates to services A and B. 

As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of data service C.  Service B’s theoretical availability is calculated similarly.  Problems with service A can propagate to service B through the “fused” data element.  For instance, if service A experiences a runaway scenario that completely consumes the capacity of data store C, service B will suffer either severe slowness or will become unavailable. 

Microservices Anti-Pattern - The Data Fuse

The easiest pattern solution for the data fuse is simply to merge the separate services.  This makes the most sense if the services can be owned by the same team.  While availability doesn’t significantly increase (service A can still affect service B, and the data store C still affects both), we don’t have the confusion of two services interacting through a fuse.  But if the rate of change for each service indicates that it needs separate teams, we need to evaluate other options (see ”when to split services”  for a discussion on drivers of services splits.

Data Fuse Microservices Anti-Pattern Fix:  Merge Services

Another way to fix the anti-pattern is to use the X axis of the Scale Cube as it relates to databases. An easy example of this is the sharing of account data between a sign-up service and a sign-in (AUTHN and AUTHZ) service.  In this example, given that sign-up is a write-based service and sign-in is a read based service we can use the X axis of the Scale Cube and split the services on a read and write basis.  To the extent that B must also log activity, it can have separate tables or a separate schema that allows that logging.  Note that the services supporting this split need not be unique - they can in fact be the exact same service - but the traffic they serve is properly segmented such that the read deployment receives only read traffic and the write deployment receives only write traffic.

Data Fuse Microservices Anti-Pattern Fix:  X Axis Read-Write Splits

 

If reads and writes aren’t an easily created X axis split, or if we need the organizational scale engendered by a Y-axis split, we need to be a bit more creative.  An example pattern comes from the differences between add-to-cart and checkout in a commerce solution.  Some functionality is shared between the components, including the notion of showing calculated sales tax and estimated shipping.  Other functionality may be unique, such as heavy computation in add-to-cart for related and recommended items, and up-sale opportunities such as gift wrapping or expedited shipping in checkout.  We also want to keep carts (session data) around in order to reach out to customers who have abandoned carts, but we don’t want this ephemeral clutter clogging the data of checkout.  This argues for separation of data for temporal (response time) reasons.  It also allows us to limit PCI compliance boundaries, removing services (add to cart) from the PCI evaluation landscape.

Data Fuse Microservices Anti-Pattern Fix:  Y Axis Data Split


Transition from add-to-cart to checkout may be accomplished through the client browser, or done as an asynchronous back end transfer of state with the browser polling for completion so as to allow for good fault isolation.  We refactor the datastore to separate data to services along the Y axis of the scale cube

Data Fuse Microservices Anti-Pattern Fix:  Moving Data when necessary for Y Axis Data Split

AKF Partners helps companies create scalable, fault tolerant, highly available and cost-effective architectures to meet their product needs.  Give us a call, we can help.

 

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Service Fuse

April 27, 2019  |  Posted By: Marty Abbott

This article is the fourth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture).  Many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively. 

The Service Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed service pool.  When the shared service “C” fails, service A and B fail as well.  Similarly, when service “C” becomes slow, slowness under high demand propagates to services A and B. 

As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of service C.  Service B’s theoretical availability is calculated similarly.  Under unusual conditions, the availability of A could also impact B similar to the way in which service fan out works.  Such would be the case if A somehow holds threads for C, thereby starving it of threads to serve B.

Because overall availability is negatively impacted, we consider the Service Fuse to be a microservice anti-pattern.

Microservice Anti-Pattern Sharing a common service deployment


The easiest and most common method to fault isolate the failure and response time propagation of Service C is to deploy it separately (in separate pools) for both Service A and B.  In doing so, we ensure that C does not fail for one service as a result of unusual demand from the other.  We also isolate failures due to unique requests that might be made by either A or B.  In doing so, we do incur some additional operational costs and additional coordination and overhead in releases.  But assuming proper automation, the availability and response time improvements are often worth the minor effort.


Solution to Service Fuse Anti-Pattern - deploy same service separately

As with many of our other anti-patterns we can also employ dynamically loadable libraries rather than separate service deployments.  While this approach has some of the slight overhead (again assuming proper automation) of the above separate service deployments, it often also benefits from significant server-side response time decreases associated with network transit. 

Solution to Service fuse Anti-Pattern - deploy service separately as libraries

We often see teams over emphasizing the cost of additional deployments.  But the separate service deployment or dynamically loadable library deployment seldom results in significantly greater effort.  Splitting the capacity of a shared pool relative to the demand split between services A and B (e.g. 50/50, 90/10, etc) and adding a small number of additional services for capacity is the real implication of such a split.  Is 5 to 10% additional operational cost and seconds of additional deployment time worth the significant increase in availability?  Our experience is that most of the time it is.

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Data Fan Out

April 21, 2019  |  Posted By: Marty Abbott

This article is the third in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern.  The second article, Service Fan Out discusses the anti-pattern of a single service acting as a proxy or aggregator of mulitple services.

Data Fan Out, the topic of this microservice anti-pattern, exists when a service relies on two or more persistence engines with categorically unique data, or categorically similar data that is not meant to be processed in parallel.  “Categorically Unique” means that the data is in no way related.  Examples of categorical uniqueness would be a database that stores customer data and a separate database that stores catalog data.  Instances of the same data, such as two separate databases each storing half of product catalog, are not categorically unique.  Splitting of similar data is often known as sharding.  Such “sharded” instances only violate the Data Fan Out pattern if:

1) They are accessed in series (database 1 is accessed and subsequently database 2 is accessed) –or-

2) A failure or slowness in either database, even if accessed in parallel, will result in a very slow or unavailable service.

Persistence engine means anything that stores data as in the case of a relational database, a NoSQL database, a persistent off-system cache, etc. 

Anytime a service relies on more than one persistence engine to perform a task, it is subject to lower availability and a response time equivalent to the slower of the N data stores to which it is connected.  Like the Service Fan Out anti-pattern, the availability of the resulting service (“Service A”) is the product of the availability of the service and its constituent infrastructure multiplied by the availability of each N data store to which it is connected. 

Further, the response of the services may be tied to the slowest of the runtime of Service A added to the slowest of the connected solutions.  If any of the N databases become slow enough, Service A may not respond at all. 

Because overall availability is negatively impacted, we consider Data Fan Out to be a microservice anti-pattern.

Microservice Anti-Pattern - Data Fan Out

One clear exception to the Data Fan Out anti-pattern is the highly parallelized querying done of multiple shards for the purpose of getting near linear response times out of large data sets (similar to one component of the MapReduce algorithm).  In a highly parallelized case such as this, we propose that each of the connections have a time-out set to disregard results from slowly responding data sets.  For this to work, the result set must be impervious to missing data.  As an example of an impervious result set, having most shards return for any internet search query is “good enough”.  A search for “plumber near me” returns 19/20ths of the “complete data”, where one shard out of 20 is either unavailable or very slow.  But having some transactions not present in an account query of transactions for a checking account may be a problem and therefore is not an example of a resilient data set.

Our preferred approach to resolve the Data Fan Out anti-pattern is to dedicate services to each unique data set.  This is possible whenever the two data sets do not need to be merged and when the service is performing two separate and otherwise isolatable functions (e.g. “Customer_Lookup” and “Catalog_Lookup”). 

Microservice Anti-Pattern Data Fan Out Solution - Split Service

When data sets are split for scale reasons, as is the case with data sets that have both an incredibly high volume of requests and a large amount of data, one can attempt to merge the queried data sets in the client.  The browser or mobile client can request each dataset in parallel and merge if successful.  This works when computational complexity of the merge is relatively low.

Microservice Anti-Pattern Data Fan Out Solution Client Side Aggregation

When client-side merging is not possible, we turn to the X Axis  of the Scale Cube for resolution.  Merge the data sets within the data store/persistence engine and rely on a split of reads and writes.  All writes occur to a single merged data store, and read replicas are employed for all reads.  The write and read services should be split accordingly and our infrastructure needs to correctly route writes to the write service and reads to the read service.  This is a valuable approach when we have high read to right ratios – fortunately the case in many solutions.  Note that we prefer to use asynchronous replication and allow the “slave” solutions to be “eventually consistent” - but ideally still within a tolerable time frame of milliseconds or a handful of seconds.

Microservice Anti-Pattern Data Fan Out Solution - Scale Cube X Axis Read Write Split


What about the case where a solution may have a high write to read ratio (exceptionally high writes), and data needs to be aggregated?  This rather unique case may be best solved by the Z axis of the AKF Scale Cube, splitting transactions along customer boundaries but ensuring the unification of the database for each customer (or region, or whatever “shard key” makes sense).  As with all Z axis shards, this not only allows faster response times (smaller data segments) but engenders high scalability and availability while also allowing us to put data “closer to the customer” using the service. 

Microservice Anti-Pattern Data Fan Out Solution - Scale Cube Y Axis Customer Split

AKF Partners helps companies create highly available, highly scalable, easily maintained and easily developed microservice architectures.  Give us a call - we can help!

Subscribe to the AKF Newsletter

Contact Us

 1 2 > 

Categories:

Most Popular: