GROWTH BLOG: What are Microservices?
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

The Scale Cube: Achieve Security Through Scalability

May 14, 2019  |  Posted By: James Fritz

AKF Scale Cube
If AKF Partners had to be known for one thing and one thing only it would be the Scale Cube.  An ingenious little model designed for companies to identify how scalable they are and set goals along any of the three axes to make their product more scalable.  Based upon the amount of times I have said scalable, or a derivative of the word scale, it should lead you to the conclusion that the AKF Scale Cube is about scale.  And you would be right.  However, the beauty of the cube is that is also applicable to Security.

Xtra Secure

The X-Axis is usually the first axis that companies look at for scalability purposes.  The concept of horizontal duplication is usually the easiest reach from a technological standpoint, however it tends to be fairly costly.  This replication across various tiers (web, application or database) also insulates companies when the inevitable breach does occur.  Planning for only security without also bracing for a data breach is a naive approach.  With replication across the tiers, and even delayed replication to protect against data corruption, not only are you able to accommodate more customers, you now potentially have a clean copy replicated elsewhere if one of your systems gets compromised, assuming you are able to identify the breach early enough.

One of the costliest issues with a breach is recovery to a secure copy.  Your company may take a hit publicity wise, but if you are able to bring your system back up to a clean state, identify the compromise and fix it, then you are can be back on your way to fully operational.  The reluctant acceptance that breaches occur is making its way into the minds of people.  If you are just open and forthright with them, the publicity issue around a breach tends to be lessened.  Showing them that your system is back up, running and now more secure will help drive business in the right direction.

SecuritY

Splitting across services (the Y-Axis) has many benefits beyond just scalability.  It provides ownership, accountability and segregation.  Although difficult to implement, especially if coming from a monolithic base, the benefits of these micro-services help with security as well.  Code bases that communicate via asynchronous calls not only allow a service to fail without a major impact to other services, it creates another layer for a potential intruder to traverse.

Steps that can be implemented to provide a defense in depth of your environment help slow/mitigate attackers.  If asynchronous calls are used between micro-services each lateral or vertical movement is another opportunity to be stopped or detected.  If services are small enough, then once access is gained threats have less access to data than may be ideal for what they are trying to accomplish.

HackerZ

Segmenting customers based upon similar characteristics (be it geography, spending habits, or even just a random selection) helps to achieve Z-Axis scalability.  These pods of customers provide protection from a full data breach as well.  Ideally no customer data would ever be exposed, but if you have 4 pods, 25% of customer data is better than 100%.  And just like the Y-Axis, these splits aid with isolating attackers into only a subset of your environment.  Various governing boards also have different procedures that need to be followed depending upon the nationality of the customer data exposed.  If segmented based upon that (eg. EU vs USA) then how you respond to breaches can be managed differently.

AKF Security

Now I Know My X, Y, Z’s

Sometimes security can take a back seat to product development and other functions within a company.  It tends to be an afterthought until that fateful day when something truly bad happens and someone gains unauthorized access to your network.  Implementing a scalable environment via the AKF Scale Cube achieves a better overall product as well as a more secure one. 

If you need assistance in reaching a more scalable and secure environment AKF is capable of helping.

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: The Service Mesh

May 8, 2019  |  Posted By: Marty Abbott

This article is the sixth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively.  The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor.  The fifth article expands the fuse metaphor from service fuses to data fuses.

Howard Anton, the author of my college Calculus textbook, was fond of the following phrase:  “It should be intuitively obvious to the casual observer….”.  The clause immediately following that phrase was almost inevitably something that was not obvious to anyone – probably not even the author.  Nevertheless, the phrase stuck with me, and I think I finally found a place where it can live up to its promise. The Service Mesh, the topic of this microservice anti-pattern, is the amalgamation of all the anti-patterns to date.  It contains elements of calls in series, fuses and fan out.  As such, it follows the rules and availability problems of each of those patterns and should be avoided at all costs. 

This is where I need to be very clear, as I’m aware that the Service Mesh has a very large following.  This article refers to a mesh as a grouping of services with request/reply relationships.  Or, put another way, a “Mesh” is any solution that violates repeatedly the anti-patterns of “tree lights”, “fuses” or “fan out”.  If you use “mesh” to mean a grouping of services that never call each other, you are not violating this anti-pattern.

What constitutes a service mesh?

What is NOT a service mesh?

The reason mesh patterns are a bad idea are many-fold:

1)  Availability:  At the extreme, the mesh is subject to the equation: [N∗(N−1)]/2.  This equation represents the number of edges in a fully connected graph with N vertices or nodes.  Asymptotically, this reduces to N2.  To make availability calculations simple, the availability of a complete mesh can be calculated as the service with the lowest availability (A)^(N*N).  If the lowest availability of a service with appropriate X-axis cloning (multiple instances) is 99.9, and the service mesh has 10 different services, the availability of your service mesh will approximate 99.910.  That’s roughly a 99% availability – perhaps good enough for some solutions but horrible by most modern standards.

2) Troubleshooting:  When every node can communicate with every other node, or when the “connectiveness” of a solution isn’t completely understood, how does one go about finding the ailing service causing a disruption?  Because failures and slowness transit synchronous links, a failure or slowness in one or more services will manifest itself as failures and slowness in all services.  Troubleshooting becomes very difficult.  Good luck in isolating the bad actor.

3) Hygiene:  I recall sitting through computer science classes 30 years ago and hearing the term “spaghetti code”.  These days we’d probably just call it “crap”, but it refers to the meandering paths of poorly constructed code.  Generally, it leads to difficulty in understanding, higher rates of defects, etc.  Somewhere along the line, some idiot has brought this same approach to deployments.  Again, borrowing from our friend Anton, it should be intuitively obvious to the casual observer that if it’s a bad practice in code it’s also a bad practice in deployment architectures.

4) Cost to Fix: If points 1 through 3 above aren’t enough to keep you away from connected service meshes, point 4 will hopefully help tip the scales.  If you implement a connected mesh in an environment in which you require high availability, you will spend a significant amount of time and money refactoring it to relieve the symptoms it will cause.  This amount may approximate your initial development effort as you remove each dependent anti-pattern (series, fuse, fan-out) with an appropriate pattern.


Microservice Anti-Pattern:  The Service Mesh

Fixing a mesh is not an easy task.  One solution is to ensure that no service blocks waiting for a request to complete of any other service.  Unfortunately, this pattern is not always easy or appropriate to implement.

Microservice Anti-Pattern Service Mesh Fix - Async Interactions

Another solution is to deploy each service as service when it is responding to an end user request, and as a library for another service wherever needed.

Microservice Anti-Pattern Service Mesh Fix - Libraries

Finally, you can traverse each service node and determine where services can be collapsed or any of the other patterns identified within the tree light, fuse, or fanout anti-patterns.


AKF Partners helps companies create scalable, fault tolerant, highly available and cost effective architectures to meet their product needs.  Give us a call, we can help

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Data Fuse

May 8, 2019  |  Posted By: Marty Abbott

This article is the fifth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively.  The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor.

The Data Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed data store.  This data store can be any persistence solution from physical file services, to a common storage area network, to relational (ACID) or NoSQL (BASE) databases.  When the shared data solution “C” fails, service A and B fail as well.  Similarly, when data solution “C” becomes slow, slowness under high demand propagates to services A and B. 

As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of data service C.  Service B’s theoretical availability is calculated similarly.  Problems with service A can propagate to service B through the “fused” data element.  For instance, if service A experiences a runaway scenario that completely consumes the capacity of data store C, service B will suffer either severe slowness or will become unavailable. 

Microservices Anti-Pattern - The Data Fuse

The easiest pattern solution for the data fuse is simply to merge the separate services.  This makes the most sense if the services can be owned by the same team.  While availability doesn’t significantly increase (service A can still affect service B, and the data store C still affects both), we don’t have the confusion of two services interacting through a fuse.  But if the rate of change for each service indicates that it needs separate teams, we need to evaluate other options (see ”when to split services”  for a discussion on drivers of services splits.

Data Fuse Microservices Anti-Pattern Fix:  Merge Services

Another way to fix the anti-pattern is to use the X axis of the Scale Cube as it relates to databases. An easy example of this is the sharing of account data between a sign-up service and a sign-in (AUTHN and AUTHZ) service.  In this example, given that sign-up is a write-based service and sign-in is a read based service we can use the X axis of the Scale Cube and split the services on a read and write basis.  To the extent that B must also log activity, it can have separate tables or a separate schema that allows that logging.  Note that the services supporting this split need not be unique - they can in fact be the exact same service - but the traffic they serve is properly segmented such that the read deployment receives only read traffic and the write deployment receives only write traffic.

Data Fuse Microservices Anti-Pattern Fix:  X Axis Read-Write Splits

 

If reads and writes aren’t an easily created X axis split, or if we need the organizational scale engendered by a Y-axis split, we need to be a bit more creative.  An example pattern comes from the differences between add-to-cart and checkout in a commerce solution.  Some functionality is shared between the components, including the notion of showing calculated sales tax and estimated shipping.  Other functionality may be unique, such as heavy computation in add-to-cart for related and recommended items, and up-sale opportunities such as gift wrapping or expedited shipping in checkout.  We also want to keep carts (session data) around in order to reach out to customers who have abandoned carts, but we don’t want this ephemeral clutter clogging the data of checkout.  This argues for separation of data for temporal (response time) reasons.  It also allows us to limit PCI compliance boundaries, removing services (add to cart) from the PCI evaluation landscape.

Data Fuse Microservices Anti-Pattern Fix:  Y Axis Data Split


Transition from add-to-cart to checkout may be accomplished through the client browser, or done as an asynchronous back end transfer of state with the browser polling for completion so as to allow for good fault isolation.  We refactor the datastore to separate data to services along the Y axis of the scale cube

Data Fuse Microservices Anti-Pattern Fix:  Moving Data when necessary for Y Axis Data Split

AKF Partners helps companies create scalable, fault tolerant, highly available and cost-effective architectures to meet their product needs.  Give us a call, we can help.

 

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Service Fuse

April 27, 2019  |  Posted By: Marty Abbott

This article is the fourth in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture).  Many of the mistakes or failure points teams create in services splits.  Articles two and three cover anti-patterns for service and data fan out respectively. 

The Service Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed service pool.  When the shared service “C” fails, service A and B fail as well.  Similarly, when service “C” becomes slow, slowness under high demand propagates to services A and B. 

As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of service C.  Service B’s theoretical availability is calculated similarly.  Under unusual conditions, the availability of A could also impact B similar to the way in which service fan out works.  Such would be the case if A somehow holds threads for C, thereby starving it of threads to serve B.

Because overall availability is negatively impacted, we consider the Service Fuse to be a microservice anti-pattern.

Microservice Anti-Pattern Sharing a common service deployment


The easiest and most common method to fault isolate the failure and response time propagation of Service C is to deploy it separately (in separate pools) for both Service A and B.  In doing so, we ensure that C does not fail for one service as a result of unusual demand from the other.  We also isolate failures due to unique requests that might be made by either A or B.  In doing so, we do incur some additional operational costs and additional coordination and overhead in releases.  But assuming proper automation, the availability and response time improvements are often worth the minor effort.


Solution to Service Fuse Anti-Pattern - deploy same service separately

As with many of our other anti-patterns we can also employ dynamically loadable libraries rather than separate service deployments.  While this approach has some of the slight overhead (again assuming proper automation) of the above separate service deployments, it often also benefits from significant server-side response time decreases associated with network transit. 

Solution to Service fuse Anti-Pattern - deploy service separately as libraries

We often see teams over emphasizing the cost of additional deployments.  But the separate service deployment or dynamically loadable library deployment seldom results in significantly greater effort.  Splitting the capacity of a shared pool relative to the demand split between services A and B (e.g. 50/50, 90/10, etc) and adding a small number of additional services for capacity is the real implication of such a split.  Is 5 to 10% additional operational cost and seconds of additional deployment time worth the significant increase in availability?  Our experience is that most of the time it is.

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Data Fan Out

April 21, 2019  |  Posted By: Marty Abbott

This article is the third in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern.  The second article, Service Fan Out discusses the anti-pattern of a single service acting as a proxy or aggregator of mulitple services.

Data Fan Out, the topic of this microservice anti-pattern, exists when a service relies on two or more persistence engines with categorically unique data, or categorically similar data that is not meant to be processed in parallel.  “Categorically Unique” means that the data is in no way related.  Examples of categorical uniqueness would be a database that stores customer data and a separate database that stores catalog data.  Instances of the same data, such as two separate databases each storing half of product catalog, are not categorically unique.  Splitting of similar data is often known as sharding.  Such “sharded” instances only violate the Data Fan Out pattern if:

1) They are accessed in series (database 1 is accessed and subsequently database 2 is accessed) –or-

2) A failure or slowness in either database, even if accessed in parallel, will result in a very slow or unavailable service.

Persistence engine means anything that stores data as in the case of a relational database, a NoSQL database, a persistent off-system cache, etc. 

Anytime a service relies on more than one persistence engine to perform a task, it is subject to lower availability and a response time equivalent to the slower of the N data stores to which it is connected.  Like the Service Fan Out anti-pattern, the availability of the resulting service (“Service A”) is the product of the availability of the service and its constituent infrastructure multiplied by the availability of each N data store to which it is connected. 

Further, the response of the services may be tied to the slowest of the runtime of Service A added to the slowest of the connected solutions.  If any of the N databases become slow enough, Service A may not respond at all. 

Because overall availability is negatively impacted, we consider Data Fan Out to be a microservice anti-pattern.

Microservice Anti-Pattern - Data Fan Out

One clear exception to the Data Fan Out anti-pattern is the highly parallelized querying done of multiple shards for the purpose of getting near linear response times out of large data sets (similar to one component of the MapReduce algorithm).  In a highly parallelized case such as this, we propose that each of the connections have a time-out set to disregard results from slowly responding data sets.  For this to work, the result set must be impervious to missing data.  As an example of an impervious result set, having most shards return for any internet search query is “good enough”.  A search for “plumber near me” returns 19/20ths of the “complete data”, where one shard out of 20 is either unavailable or very slow.  But having some transactions not present in an account query of transactions for a checking account may be a problem and therefore is not an example of a resilient data set.

Our preferred approach to resolve the Data Fan Out anti-pattern is to dedicate services to each unique data set.  This is possible whenever the two data sets do not need to be merged and when the service is performing two separate and otherwise isolatable functions (e.g. “Customer_Lookup” and “Catalog_Lookup”). 

Microservice Anti-Pattern Data Fan Out Solution - Split Service

When data sets are split for scale reasons, as is the case with data sets that have both an incredibly high volume of requests and a large amount of data, one can attempt to merge the queried data sets in the client.  The browser or mobile client can request each dataset in parallel and merge if successful.  This works when computational complexity of the merge is relatively low.

Microservice Anti-Pattern Data Fan Out Solution Client Side Aggregation

When client-side merging is not possible, we turn to the X Axis  of the Scale Cube for resolution.  Merge the data sets within the data store/persistence engine and rely on a split of reads and writes.  All writes occur to a single merged data store, and read replicas are employed for all reads.  The write and read services should be split accordingly and our infrastructure needs to correctly route writes to the write service and reads to the read service.  This is a valuable approach when we have high read to right ratios – fortunately the case in many solutions.  Note that we prefer to use asynchronous replication and allow the “slave” solutions to be “eventually consistent” - but ideally still within a tolerable time frame of milliseconds or a handful of seconds.

Microservice Anti-Pattern Data Fan Out Solution - Scale Cube X Axis Read Write Split


What about the case where a solution may have a high write to read ratio (exceptionally high writes), and data needs to be aggregated?  This rather unique case may be best solved by the Z axis of the AKF Scale Cube, splitting transactions along customer boundaries but ensuring the unification of the database for each customer (or region, or whatever “shard key” makes sense).  As with all Z axis shards, this not only allows faster response times (smaller data segments) but engenders high scalability and availability while also allowing us to put data “closer to the customer” using the service. 

Microservice Anti-Pattern Data Fan Out Solution - Scale Cube Y Axis Customer Split

AKF Partners helps companies create highly available, highly scalable, easily maintained and easily developed microservice architectures.  Give us a call - we can help!

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Service Fan Out

April 8, 2019  |  Posted By: Marty Abbott

This article is the second in a multi-part series on microservices (micro-services) anti-patterns.  The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern.

Fan Out, the topic of this microservice anti-pattern, exists when one service either serves as a proxy to two or more downstream services, or serves as an integration of two subsequent service calls. Any of the services (the proxy/integration service “A”, or constituent services “B” and “C”) can cause a failure of all services.  When service A fails, service B and C clearly can’t be called.  If either service B or C fails or becomes slow, they can affect service A by tying up communication ports.  Ultimately, under high call volume, service A may become unavailable due to problems with either B or C.

Further, the response of the services may be tied to the slowest responding service.  If A needs both B and C to respond to a request (as in the case of integration), then the speed at which A responds is tied to the slowest response times of B and C.  If service A merely proxies B or C, then extreme slowness in either may cause slowness in A and therefore slowness in all calls.

Because overall availability is negatively impacted, we consider Service Fan Out to be a microservice anti-pattern.

Microservice Anti-Pattern Service Fan Out


One approach to resolve the above anti-pattern is to employ true asynchronous messaging between services.  For this to be successful, the requesting service A must be capable of responding to a request without receiving any constituent service responses.  Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A.  One such example is a recommendation engine that returns other items a user might like to purchase.  The absence of service B responding to A’s request for recommendations is unfortunate but doesn’t eliminate the value of A’s response completely.

Fix to Service Fan Out Anti-Pattern - Async Calls

As was the case with the Calls In Series Anti-Pattern, we may also be able to solve this anti-pattern with ”Libraries for Depth” pattern.

Fix to Service Fan Out Anti-Pattern - Libraries

Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to a separately deployed service call.  For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc.  Additionally, call latency goes down without network interfaces.

The most common complaint about this pattern is that development teams cannot release independently.  But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.

Finally, we can remove the proxy/integration service into the browser and make multiple browser requests.  Data returned from service B or service C can either be displayed in separate browser frames/divisions or can be evaluated and integrated using browser scripting (e.g. javascript).  We prefer this method whenever possible.  If A is simply serving as a proxy, the solution is relatively simple.  If A was serving as an integration/aggregation service then Service A’s logic must be moved into the browser/client.  Doing so creates complete fault isolation and allows the services to fail independently without an impact on each other.

Fix to Service Fan Out Anti-Pattern - Browser Fan Out

AKF Partners has helped to architect some of the most scalable, highly available, fault-tolerant and fastest response time solutions on the internet.  Give us a call - we can help.

Subscribe to the AKF Newsletter

Contact Us

Microservice Anti-Pattern: Calls in Series (The Xmas Tree Light Anti-Pattern)

March 25, 2019  |  Posted By: Marty Abbott

This article is the first in a multi-part series on microservices (micro-services) anti-patterns. 

There are several benefits to carving up very large applications into service-oriented architectures.  These benefits can include many of the following:

  • Higher availability through fault isolation
  • Higher organizational scalability through lower coordination
  • Lower cost of development through lower overhead (coordination)
  • Faster time to market achieved again through lower overhead of coordination
  • Higher scalability through the ability to independently scale services
  • Lower cost of operations (cost of goods sold) through independent scalability
  • Lower latency/response time through better cacheability

The above should be considered only a partial list.  See our articles on the AKF Scale Cube, and when you should split services for more information.

In order to achieve any of the above benefits, you must be very careful to avoid common mistakes. 

Most of the failures that we see in microservices stem from a lack of understanding of the multiplicative effect of failure or “MEF”.  Put simply, MEF indicates that the availability of any solution in series is a product of the availability of all components in that series. 

Service A has an availability calculated by the product of its constituent parts.  Those parts include all of the software and infrastructure necessary to run service A.  The server availability, the application availability, associated library and runtime environment availabilities, operating system availability, virtualization software availability, etc.  Let’s say those availabilities somehow achieve a “service” availability of “Five 9s” or 99.999 as measured by duration of outages.  To achieve 99.999 we are assuming that we have made the service “highly available” through multiple copies, each being “stateless” in its operation.

Service B has a similar availability calculated in a similar fashion.  Again, let’s assume 99.999.

If, for a request from any customer to Service A, Service B must also be called, the two availabilities are multiplied together.  The new calculated availability is by definition lower than any service in isolation.  We move our availability from 99.999 to 99.998. 

When calls in series between services become long, availability starts to decline swiftly and by definition is always much smaller than the lowest availability of any service or the constituent part of any service (e.g. hardware, OS, app, etc).

This creates our first anti-pattern.  Just as bulbs in the old serially wired Christmas Tree lights would cause an entire string to fail, so does any service failure cause the entire call stream to fail.  Hence multiple names for this first anti-pattern:  Christmas Tree Light Anti-Pattern, Microservice Calls in Series Anti-Pattern, etc.

 Microservice Anti-Pattern - Calls in Series

The multiplicative effect of failure sometimes is worse with slowly responding solutions than with failures themselves.  We can easily respond from failures through “heartbeat” transactions.  But slow responses are more difficult.  While we can use circuit breaker constructs such as hystrix switches – these assume that we know the threshold under which our call string will break.  Unfortunately, under intense flash load situations (unforeseen high demand), small spikes in demand can cause failure scenarios.

One pattern to resolve the above issue is to employ true asynchronous messaging between services.  To make this effective, the requesting service must not care whether it receives a response.  This service must be capable of responding to a request without receiving any downstream response.  Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A.  One such example is a recommendation engine that returns other items a user might like to purchase.  The absence of service B responding to A’s request for recommendations is unfortunate, but doesn’t eliminate the value of A’s response completely.

Microservice Calls in Series Anti-Pattern Solution - Async Calls

While the above pattern can resolve some use-cases, it doesn’t resolve most of them.  Most often downstream services are doing more than “modifying” value for the calling service:  they are providing specific necessary functions.  These functions may be mail services, print services, data access services, or even component parts of a value stream such as “add to cart” and “compute tax” during checkout.

In these cases, we believe in employing the Libraries for Depth pattern.

Microservice Calls in Series Anti-Pattern Solution - Use Libraries

Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to another service call.  For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc.  Additionally, call latency goes down without network interfaces.

The most common complaint about this pattern is that development teams cannot release independently.  But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.

 

Subscribe to the AKF Newsletter

Contact Us

The AKF Partners Session State Cube

March 19, 2019  |  Posted By: Marty Abbott

Tim Berners-Lee and his colleagues at CERN, the IETF and the W3C consortium all understood the value of being stateless when they developed the Hyper Text Transfer Protocol.  Stateless systems are more resilient to multiple failure types, as no transaction needs to have information regarding the previous transaction.  It’s as if each transaction is the first (and last) of its type.

First let’s quickly review three different types of state.  This overview is meant to be broad and shallow.  Certain state types (such as the notion of View state in .Net development) are not covered.

High level overview of state for application, connection and session state

The Penalty (or Cost) of State

State costs us in multiple ways.  State unique to a user interaction, or session state, requires memory.  The larger the state, the more memory requirement, the higher cost of the server and the greater the number of servers we need.  As the cost of goods sold increase, margins decrease.  Further, that state either needs to be replicated for high availability, and additional cost, or we face a cost of user dissatisfaction with discrete component and ultimately session failures. 

When application state is maintained, the cost of failure is high as we either need to pay the price of replication for that state or we lose it, negatively impacting customer experience.  As memory associated with application state increases, so does the memory requirement and associated costs of the server upon which it runs.  At high scale, that means more servers, greater costs, and lower gross margins.  In many cases, we simply have no choice but to allow application state.  Interpreters and java virtual machines need memory.  Most applications also require information regarding their overall transactions distinct from those of users.  As such, our goal here is not to eliminate application state but rather minimize it where possible.

When connection state is maintained, cost increases as more servers are required to service the same number of requests.  Failures become more common as the failure probability increases with the duration of any connection over distance. 

Our ideal outcome is to eliminate session state, minimize application state and eliminate connection state.


Desired State Outcomes - Application, Connection and Session

But What if I Really, Really, Really Need State?

Our experience is that simply saying “No” once or twice will force an engineer to find an innovative way to eliminate state.  Another interesting approach is to challenge an engineer with a statement like “Huh, I heard the engineers at XYZ company figured out how to do this…”.  Engineers hate to feel like another engineer is better than them…

We also recognize however that the complete elimination of state isn’t possible.  Here are three examples (not meant to be all inclusive) of when we believe the principle of stateless systems should be violated:

Shopping Cart - Approved State Example

Shopping carts need state to work.  Information regarding a past transaction - (add_to_cart) for instance needs to be held somewhere prior to check_out.  Given that we need state, now it’s just a question of where to store it.  Cookies are good places.  Distributed object caches are another location.  Passing it through the URL in HTTP GET methods is a third.  A final solution is to store it in a database.

Debit Credit Approved State Example

No sane person wants to wrap debits and credits across distributed servers in a single, two-phase commit transaction.  Banks have had a solution for this for years – the eventual consistent account transaction.  Using a tiny workflow or state machine, debit in one transaction and eventually (ideally quickly) subsequently credit in a second transaction.  That brings us to the notion of workflow and state machines in general.

Workflow Manager Approved State Example

What good is a state machine if it can’t maintain state?  Whether application state or session state, the notion of state is critical to the success of each solution.  Workflow systems are a very specific implementation of a state machine and as such require state.  The trick with these is simply to ensure that the memory used for state is “just enough”.  Govern against ever increasing session or application state size.

This brings us to the newest cube model in the AKF model repository: 

The Session State Cube

AKF Session and State Cube Model

The AKF State Cube is useful both for thinking through how to achieve the best possible state posture, and for evaluating how well we are doing against an aspiration goal (top right corner) of “Stateless”.

X Axis

The X axis describes size of state.  It moves from very large (XL) state size to the ideal position of zero size, or “No State”.  Very large state size suffers from higher cost, higher impact upon failure, and higher probability of failure. 

Y Axis

The Y axis describes the degree of distribution of state.  The worst position, lower left, is where state is a singleton.  While we prefer not to have state, having only one copy of it leaves us open to large – and difficult to recover from – failures and dissatisfied customers.  Imagine nearly completing your taxes only to have a crash wipe out all of your work!  Ughh!

Progressing vertically along the Y axis, the singleton state object in the lower left is replicated into N copies of that state for high availability.  While resolving the recovery and failure issues, performing replication is costly both in extra memory and network transit.  This is an option we hope to avoid for cost reasons.

Following replication are several methods of distribution in increasing order of value.  Segmenting the data by some value “N” has increasing value as N increases.  When N is 2, a failure of state impacts 50% of our customers.  When N is 100, only 1% of our customers suffer from a state failure.  Ideally, state is also “rebuildable” if we have properly scattered state segments by a shard key – allowing customers to only have to re-complete a portion of their past work. 

Finally, of course, we hope to have “no state” (think of this as division by infinite segmentation approaching zero on this axis).

Z Axis

The Z Axis describes where we position state “physically”. 

The worst location is “on the same server as the application”.  While necessary for application state, placing session data on a server co-resident with the application using it doubles the impact of a failure upon application fault.  There are better places to locate state, and better solutions than your application to maintain it.

A costly, but better solution from an impact perspective is to place state within your favorite database.  To keep costs low, this could be an opensource SQL or NoSQL database.  But remember to replicate it for high availability.

A less costly solution is to place state in an object cache, off server from the application.  Ideally this cache is distributed per the Y axis.

The least costly solution is to have the client (browser or mobile app) maintain state.  Use a cookie, pass the state through a GET method, etc.

Finally, of course the best solution is that it is kept “nowhere” because we have no state.

Summary

The AKF State Cube serves two purposes:

  1. Prescriptive:  It helps to guide your team to the aspirational goal of “stateless”.  Where stateless isn’t possible, choose the X, Y and Z axis closest to the notion of no state to achieve a low cost, highly available solution for your minimized state needs.
  2. Descriptive: The model helps you evaluate numerically, how you are performing with respect to stateless initiatives on a per application/service basis.  Use the guide on the right side of the model to evaluate component state on a scale of 1 to 10.

AKF Partners helps companies develop world class, low cost of operations, fast time to market, stateless solutions every day.  Give us a call!  We can help!

Subscribe to the AKF Newsletter

Contact Us

Is the Co-location (colo) Industry Dying?

March 15, 2019  |  Posted By: Marty Abbott

Empty Data Room with racks removed and floor tiles pulled
I’m no Nostradamus when it comes to predicting the future of technology, but some trends are just too blatantly obvious to ignore.  Unfortunately, they are only easy to spot if you have a job where you are allowed (I might argue required) to observe broader industry trends.  AKF Partners must do that on behalf of our clients as our clients are just too busy fighting the day-to-day battles of their individual businesses.

One such very concerning probability is the eventual decline – and one day potentially the elimination of – the colocation (hosting) business.  Make no mistake about it – if you lease space from a colocation provider, the probability is high that your business will need to move locations, move providers, or experience a service disruption soon.

Let’s walk through the factors and trends that indicate, at least to me, that the industry is in trouble, and that your business faces considerable risks:

Sources of Demand for Colocation (Macro)

Broadly speaking, the colocation industry was built on the backs of young companies needing to lease space for compute, storage, and the like.  As time progressed, more established companies started to augment privately-owned data centers with colocation facilities to avoid the burden of large assets (buildings, capital improvements and in some cases even servers) on their balance sheets.

The first source of demand, small companies, has largely dried up for colocation facilities.  Small companies seek to be “asset light” and most frequently start their businesses running on Infrastructure as a Service (IaaS) providers (AWS, GCP, Azure etc.).  The ease and flexibility of these providers enable faster time to market and easier operational configuration of systems.  Platform as a Service (PaaS) offerings in many cases eliminate the need for specialized infrastructure and DevOps skill sets, allowing small companies to focus limited funds on software engineers that will help create differentiating experiences and capabilities.  Five years ago, successful startups may have started migrating into colocation facilities to lower costs of goods sold (COGS) for their products, and in so doing increase gross margin (GM).  While this is still an opportunity for many successful companies, few seem to take advantage of it.  Whether due to vendor lock-in through PaaS services, or a preference for speed and flexibility over expenses, the companies tend to stay with their IaaS provider.

Larger, more established companies continue to use colocation facilities to augment privately-owned data centers.  That said, in most cases technology refresh results in faster and more efficient compute.  When the rate of compute increases faster than the rate of growth in transactions and revenue within these companies, they start to collapse the infrastructure assets back into wholly-owned facilities (assuming power, space, and cooling of the facilities are not constraints).  Bringing assets back in-house to owned facilities lowers costs of goods sold as the company makes more efficient use of existing assets. 

Simultaneously these larger firms also seek the flexibility and elasticity of IaaS services.  Where they have new demand for new solutions, or as companies embark upon a digital transformation strategy, they often do so leveraging IaaS.

The result of these forces across the spectrum of small to large firms reduces overall demand.  Reduced demand means a contraction in the colocation industry overall.

Minimum Efficient Scale and the Colocation Industry (Micro)

Data centers are essentially factories.  To achieve optimum profitability, fixed costs such as the facility itself, and the associated taxes, must be spread across the largest possible units of production.  In the case of data centers, this means achieving maximum utilization of the constraining factors (space, power, and cooling capacity) across the largest possible revenue base.  Maximizing utilization against the aforementioned constraints drops the LRAC (long run average cost) as fixed costs are spread across a larger number of paying customers.  This is the notion of Minimum Efficient Scale in economics.

Minimum Efficient Scale

As demand decreases, on a per data center (colocation facility) basis, fixed costs per customer increases.  This is because less space is used, and the cost of the facility is allocated across fewer customers.  At some point, on a per data center basis the facility becomes unprofitable.  As profits dwindle across the enterprise, and as debt service on the facilities becomes more difficult, the colocation provider is forced to shut down data centers and consolidate customers.  Assets are sold or leases terminated with the appropriate termination penalties.

Minimum Efficient Scale - Colocation Industry Data Center Failure Line

Customers who wish to remain with a provider are forced to relocate.  This in turn causes customers to reconsider colocation facilities, and somewhere between a handful to a majority on a per location basis will decide to move to IaaS instead.  Thus begins a vicious cycle of data center shutdowns engendering ever-decreasing demand for colocation facilities. 

Excluding other macroeconomic or secular events like another real estate collapse, smaller providers start to exit the colocation service industry.  Larger providers benefit from the exit of smaller players and the remaining data centers benefit from increased demand on a dwindling supply, allowing those providers to regain MES and profitability.

Does the Trend Stop at a Smaller Industry?

We are likely to continue to see the colocation industry exist for quite some time – but it will get increasingly smaller.  The consolidation of providers and dwindling supply of facilities will stop at some point, but just for a period.  Those that remain in colocation facilities will either not have the means or the will to move.  In some cases, a lack of skills within the remaining companies will keep them “locked into” a colocation.  In other cases, competing priorities will keep an exit on the distant horizon.  These “lock in” factors will give rise to an opportunity for the colocation industry to increase pricing for a time. 

But make no mistake about it, customers will continue to leave – just at a decreased rate relative to today’s departures.  Some companies will simply go out of business or contract in size and depart the data centers.  Others will finally decide that the increasing cost of service is too high.

While it’s doubtful that the industry will go away in its entirety, it will be small and comparatively expensive.  The difference between costs of colocation and costs to run in an IaaS solution will start to dwindle.

Risks to Your Firm

The risk to your firm comes in three forms, listed in increasing order of risk as measured by a function of probability of occurrence and impact upon occurrence:

  1. Pricing of service per facility.  If you are lucky enough that your facility does not close, there is a high probability that your cost for service will increase.  This in turn increases your cost of goods sold and decreases your gross margin.
  2. Risk of facility dissolution.  There exists an increasingly high probability that the facilities in which you are located will be shut down.  While you are likely to be given some advance notice, you will be required to move to another facility with the same provider, or another provider.  There is both a real cost in the move, and an opportunity cost associated with service interruption and effort.
  3. Risk of firm as a going concern.  Some providers of colocation services will simply exit the business.  In some cases, you may be given very little notice as in the case of a company filing bankruptcy.  Service interruption risk is high.

Strategies You Must Employ Today

In our view, you have no choice but to ensure that you are ready and able to easily move out of colocation facilities.  Whether this be to existing data centers you own, IaaS providers, or a mix matters not.  At the very least, we suggest your development and operations processes enable the following principles:

  1. Environment Agnosticism:  Ensure that you can run in owned, lease, managed service, or IaaS locations.  Ensuring consistency in deployment platforms, using container technologies and employing orchestration systems all aid in this endeavor.
  2. Hybrid Hosting:  Operate out of at least two of the following three options as a course of normal business operations: owned data centers, leased/colocation facilities, IaaS.
  3. Dynamic Allocation of Demand: Prove on at least a weekly basis that you can operate any functionality within your product out of any location you operate – especially those that happen to be located within colocation facilities.

AKF Partners helps companies think through technology, process, organization, location, and hosting strategies.  Let us help you architect a hybrid hosting solution that limits your risk to any single provider.

Subscribe to the AKF Newsletter

Contact Us

Don't Let the Tail Wag the Dog

February 22, 2019  |  Posted By: Greg Fennewald

On multiple occasions over the years, we have heard our clients state a use case they want to avoid in product design sessions or as a reason for architectural choices made for existing products. These use cases can be given more credence than they deserve based on objective data – they become boogeyman legends, edge cases that can result in poor architectural choices. 

Picture of a dog with a quizzical look

One of our clients was debating the benefit of multiple live sites with customers pinned to the nearest site to minimize latency. The availability benefits of multiple live sites are irrefutable, but the customer experience benefit of less latency was questioned.  This client had millions of clients spread across the country.  The notion of pinning a client to a “home” site nearest them raised the question of “what happens when the client travels across the country?”.  The answer is to direct them to that same home site.  That client will experience more latency for the duration of the visit.  The proportion of clients that spend 50% of their time on either coast is vanishingly small – keep it simple.  Have a work around for clients that permanently move to a location served by a different site – client data resides in more than one location for DR purposes anyway, right?

This client also had hundreds of service specialists that would at times access client accounts and take actions on their behalf, and these service specialists were located near the west coast.  Objections were made based on the latency a west coast service specialist would encounter when acting on the behalf of an east coast client whose data was hosted near the east coast.  Millions of clients.  Hundreds of service specialists.  The math is not hard.  The needs of the many outweigh the needs of the few.

A different client had a concern about data consistency upon new user registration for their service.  To ensure a new customer could immediately transact, the team decided to deploy a single authentication server to preclude the possibility of a transaction following registration hitting an authentication server that had not yet received the registration data.  Intentionally deploying a SPOF should have raised immediate objections but did not.  The team deployed a passive backup server that required manual intervention to work. 

The new user process flow was later revealed to be less than 3% of the overall transactions.  97% of the transactions suffered an impactful outage along with the 3% new users when the SPOF authentication server failed.  Designing a workaround for the new users while employing a write master with multiple, load balanced read only slaves would provide far better availability.  The needs of the many outweigh the needs of the few.

It is important to remain open minded during early design sessions.  It is also important to follow architectural principles in the face of such use cases.  How can one balance potentially conflicting concepts?

• Ask questions best answered with objective data.
• Strive for simplicity, shave with Occam’s Razor
• Validate whether the edge case is a deal breaker for the product owner
• Propose a work around that addresses the edge case while optimizing the architecture for the majority use case and sound principles.

Catering to the needs of the business while adhering to architectural standards is a delicate balancing act and compromises will be made.  Everyone looks at the technologist when a product encounters a failure.  Know when to hold the line on sound architectural principles that safeguard product availability and user experience.  The product owner must understand and acknowledge the architectural risks resulting from product design decisions.  The technologist must communicate these risks to the product owner along with objective data and options.  A failure to communicate effectively can lead to the tail wagging the dog – do not let that happen.

With 12 years of product architecture and strategy experience, AKF Partners is uniquely positioned to be your technology partner.  Learn more here.

Subscribe to the AKF Newsletter

Contact Us

 < 1 2 3 4 > 

Categories:

Most Popular: