GROWTH BLOG: AKF Partners Announces European Expansion
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

Architectural Principles: Build vs. Buy

September 6, 2019  |  Posted By: Pete Ferguson

build vs buy graphic
In many of our technical due diligence engagements, it is common to find that companies are building tools with considerable development effort (and ongoing maintenance) for something that is not part of their core strength and thus providing a competitive advantage. What criteria does your organization us in deciding when to build vs. buy?

If you perform a simple web search for “build vs. buy” you will find hundreds of articles, process flows, and decision trees on when to build and when to buy. Many of these are cost-centric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.

Buy When Non-Core – If you aren’t the best at building it and it doesn’t offer a competitive differentiation, buy it. Regardless of how smart you and your team are – you simply aren’t the best at everything

We have many examples from our customers developing load balancing software, building their own databases, etc. In nearly every case, a significant percentage of the engineering team (and engineering cost) go into a solution that:

  1. Does not offer long term competitive differentiation
  2. Costs more than purchasing an existing product
  3. Steals focus away from the engineering team
  4. Is not aligned with the skills or business level outcomes of the team

If You Can’t Beat Them - Join Them
(or buy, rent, or license from them)

Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:

Shiny object distraction is a very real thing we observe regularly. Companies start - innocently enough - building a custom tool in a pinch to get them by, but never go back and reassess the decision. Over time the solution snowballs and consumes more and more resources that should be focused on innovating strategic differentiation.

  • We have yet to hear a tech exec say “we just have too many developers, we aren’t sure what to do with them.”
  • More often than not “resource constraints” is mentioned within the first few hours of our engagements.
  • If building instead of buying is going to distract from focusing efforts on the next “big thing” – then 99% of the time you should just stop here and attempt to find a packaged product, open-source solution, or outsourcing vendor to build what you need.

If after reviewing these points, if the answer is “Yes, it will provide a strategic differentiation,” then proceed to question 2.

This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else.

For instance, if you are a social networking site, you probably don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”.

And please, don’t fool yourself – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?

We can differentiate with our own solution, but don’t need to build everything <br />
(e.g. shopping cart or reviews)

We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision.

  • If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity.
  • Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time.
  • A “build” decision today will look bad tomorrow as features converge and pricing declines.

If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).


  • Is it cheaper to build than buy when considering the total lifecycle (implementation through end-of-life) of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc
  • If your business REALLY grows and is extremely successful, do you want to be continuing to support internally-developed monitoring and logging solutions, mobile architecture, payments, etc. through the life of your product?

Don’t fool yourself into answering this affirmatively just because you want to work on something “neat.” Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.

There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing,” but we feel these four questions are sufficient for most cases.

A “build” decision is indicated when the answers to all 4 questions are “Yes.”

We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a concern) anytime you answer “No” to any question above.


While startups and small companies roll their own tools early on to get product out the door, as they grow, the timeline of planning (and related costs) needs to increase from the next sprint to a longer-term annual and multi-year strategy. That, plus growth, tips the scale to buy instead of build. The more internal products produced and supported, the more tech debt is required and distracts medium-to-large organizations from competing against the next startup.

While building custom tools and products seems to make sense in the immediate term, looking at the long-term strategy and desired outcome of your organization needs to be fully-weighted in the decision process. Distraction from focus is the number one harm we have seen many times with our clients as they fall behind the competition and burn sprint cycles on maintaining products that don’t move the needle with their customers. The crippling cost of distractions is what causes successful companies from losing their competitive advantage as well as slipping into oblivion.

Like the ugly couch your auntie gave you for your first apartment, it can often be difficult to assess what makes sense without an outside opinion. Contact us, we can help!

Subscribe to the AKF Newsletter

Contact Us

Migrating To The Cloud: Lift and Shift Versus Cloud Native

August 20, 2019  |  Posted By: Dave Berardi

If your company doesn’t utilize one of the big cloud providers for either IaaS or PaaS as part of product infrastructure, it’s only a matter of time. We often find our clients in situations where they are pressured to move quickly for benefit-realization to improve many aspects of their business.

Drivers of this trend that exist across our client base and the industry include:

  • The Need For Speed and Time To Market: The need to scale capacity quickly without waiting weeks or months for hardware procurement and provisioning in your own datacenter or colo.
  • Traditional On-Prem Software Dying by 1000 Cuts: Demand-side (buyer) forces are encouraging companies to get services and software out of data centers. Cloud-native SaaS competition is pressuring what’s left of the on-prem software providers.
  • Legacy Company Talent Challenges: The inability of the old economy companies to hire engineering talent to support on-prem software in house.

Several different approaches can be used for migration. We’ve seen many of them and there are two on opposite ends of the spectrum – Lift and Shift and Cloud-Native – that we want to unpack.

The Lift and Shift Approach:

What is it?

waiting for challenger launch - normalization of deviance

Put simply, this is when the same architecture, resources, and services from an on-prem or colo data center are moved up into a cloud provider. Often VMs from on-prem hosting centers are converted and dropped into reserved virtual compute instances. Tools such as AWS Connector for vCenter or GCP’s Velostrata, in theory, allow for an easy transition.


  • Fastest path to cloud
  • Same architecture and tech stack minimizes training need – infrastructure management does require knowledge of the console
  • Least costly in terms of planning, architecture changes, refactoring


  • Monolithic nature of the architecture can prove to be costly thru BYOL and compute requirements
  • Minimal use of native elasticity and resources create cost-inefficient use of compute, memory, and storage and may not perform as needed
  • Technical debt migrates with the product and cost could be magnified with additional problems and a shift to the pay for use model

While Lift and Shift seems to be the easiest path, you need to be aware of the strong potential for an increase in cost in the cloud. Running VMs in your own DC and colo masks the cost inefficiencies since they are all part of Capex for your compute, storage, and network. When you move to public cloud the provider will promise to be cheaper. But in the cloud you will pay for every reserved CPU that isn’t utilized, storage that isn’t used, and other idle resources. Further, your availability can only be as good as the provider’s uptime for a given Region and/or Availability Zone.

Cloud Native Approach:

What is it?

waiting for challenger launch - normalization of deviance

Cloud-Native approach ultimately allows for the use of a provider’s cloud services as long as there are requests and demand being created by product users. This approach almost always requires investment into splitting the monolith and moving to a services-separated architecture. In addition, it could require you to use native services in your provider of choice. Doing so lets you move from paying for provisioned infrastructure to consumption-based services with better cost-efficiency.


  • Less time needed to manage infrastructure and more time for features and experimentation
  • Easier to scale out using native services
  • Most cost-efficient


  • Slowest path to cloud
  • More discovery and training - this approach requires your teams to understand the current tech stack in order to recreate them in cloud. From a cloud perspective they must understand how the provider of choice works so that decisions can be made on native services.
  • Increased risk of vendor lock-in (eg. Building out event-driven services with rules inside of native serverless)

The Cloud Native path is a longer one, but provides several benefits that will yield more value over time. With this approach you must spend time determining how to split up your monolith and how to best leverage the right combination of Availability Zones, Regions, and use of native services depending on your Recovery Time Objective (RTO) and Recovery Point Objectives (RPO). We prefer to solve scalability and availability problems with systems and software architecture to avoid vendor lock-in. All of the trade-offs on such a journey must be understood.

We have helped several companies of various sizes move to the cloud going thru SaaS transformations and have engaged in reviewing proposed architectures. Contact us to see how we can help.

Subscribe to the AKF Newsletter

Contact Us

AKF Scale Cube Explained – Cloud Scalability Rules

August 7, 2019  |  Posted By: Pete Ferguson

AKF Scale Cube Diagram

Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems.  While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize.  Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.

We see more and more new startups in AWS, Google, and Azure – in addition to assisting well-established companies make the transition to the cloud.  Regardless of the hosting platform, in our technical due diligence reviews, we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)

This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.

Scalability Rules – Chapter 2: Distribute Your Work

Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.

So how did they do it?  ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.

“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners & former VP of Architecture at eBay & Service Now)

akf scale cube with explanations of each axis

The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.

At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability.  The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube).  Sometimes, it’s easier to see these three axes without the confined space of the cube.

AKF Scale Cube Simplified

X Axis – Horizontal Duplication

akf scale cube x axis infographic Scalability Rules 7

The X Axis allows transaction volumes to increase easily and quickly.  If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.

A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer.  This cloning allows the distribution of transactions across systems evenly for horizontal scale.  Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed.  Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time.  To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.

Y Axis – Split by Function, Service, or Resource

akf scale cube y axis infographic Scalability Rules 8

Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy.  The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.

In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption.  This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.

Y Axis splits also apply to a noun approach.  Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.

While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases.  Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well.  This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.

Z Axis – Separate Similar Things

akf scale cube z axis infographic Scalability Rules 9

Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology.  Z Axis separation is useful for containerizing customers or a geographical replication of data.  If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geography, or other subset of data.

This means that each larger customer or geography could have its own dedicated Web, application, and database servers.  Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.

For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches.  This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.

Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets.  In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.


This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented.  We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands).  Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!

Subscribe to the AKF Newsletter

Contact Us

Architecture Principles: Messaging Systems – Smart End Points, Dumb Pipes

July 29, 2019  |  Posted By: Marty Abbott

Asynchronous messaging systems are a critical component of many highly scalable and highly available architectures.  But, as with any other architectural component, these solutions also need attention to ensure availability and scalability.  The solution should scale along one of the scale cube axes, either X, Y or Z.  The solution should also both include and enable the principle of fault isolation.  Finally, it should scale cost both gracefully and cost effectively while enabling high levels of organizational scale. These requirements bring us to the principle of Smart End Points and Dumb Pipes. 

Fast time to market within software development teams is best enabled when we align architectures and organizations such that coordination between teams is reduced (see Conway’s Law and our white paper on durable cross functional product teams).  When services within an architecture communicate, especially in the case of one service “publishing” information for the consumption of multiple services, the communication often needs to be modified or “transformed” for the benefit of the consumers.  This transformation can happen at the producer, the transport mechanism or the consumer.  Transformation by the producer for the sake of the consumer makes little sense, as the producer service and its associated team have low knowledge of the consumer needs and it creates an unnecessary coordination task between producer and consumer.  Transformation “in flight” by the service similarly implies a team of engineers who must be both knowledgeable about all producers and consumers and an unnecessary coordination activity.  Transformation by the consumer makes most sense, as the consumer has the most knowledge of what they need from the message and eliminates reliance upon and coordination with other teams.  The principle of smart end points and dumb pipes then creates the lowest coordination between teams, the highest level of organizational scale and the best time to market option.

To be successful achieving a dumb pipe, we introduce the notion of a pipe contract.  Such a contract explains the format of messages produced on and consumed from the pipe.  It may indicate that the message will be in a tag delimited format (XML, YAML, etc), abide by certain start and end delimiters, and for the sake of extensibility allow for custom tags for new information or attributes.  The contract may also require that consumption not be predicated on strict order of elements (e.g. title is always first) but rather by strict adherence to tag and value regardless of where each tag is in the message. 

Smart End Points Dumb Pipes Message Contract

By ensuring that the pipe remains dumb, the pipe can now scale both more predictably and cost effectively.  As no transformation compute happens within the pipe, its sole purpose becomes the delivery of the message conforming to the contract.  Large messages do not go through computationally complex transformation, meaning low compute requirements and therefore low cost.  The lack of computation also means no odd “spikes” as transforms start to stall delivery and eat up valuable resources.  Messages are delivered faster (lower latency).  An additional unintended benefit is that because transforms aren’t part of message transit, a type of failure (computational/logical) does not hinder message service availability.

The 2x2 matrix below summarizes the options here, clearly indicating smart end points and dumb pipes as the best choice.

Smart End Points Dumb Pipes Comparison 2x2 Matrix

One important callout here is that “streams processing”, which is off-message platform evaluation of message content, is not a violation of the smart end points, dumb pipes concept.  The solutions performing streams processing are simply consumers and producers of messages, subscribing to the contract and transport of the pipe.

Summarizing all of the above, the benefits of smart end points and dumb pipes are:

  1. Lower cost of messaging infrastructure - pushes the cost of goods sold closer to the producer and consumer.  Allows messaging infrastructure to scale by number of messages instead of computational complexity of messages.  License cost is reduced as fewer compute nodes are needed for message transit.
  2. Organization Scalability – teams aren’t reliant on transforms created by a centralized team.
  3. Low Latency – because computation is limited, messages are delivered more quickly and predictably to end consumers.
  4. Capacity and scalability of messaging infrastructure – increased significantly as compute is not part of the scale of the platform.
  6. Availability of messaging infrastructure – because compute is removed, so is a type of failure.  As such, availability increases.

Two critical requirements for achieving smart end points and dumb pipes:

  • Message contracts – all messages need to be of defined form.  Producers must adhere to that form as must consumers.
  • Team behaviors – must assure adherence to contracts.

AKF Partners helps companies build scalable, highly available, cost effective, low-latency, fast time to market products.  Call us – we can help!

Subscribe to the AKF Newsletter

Contact Us

Implementing Scalable, Highly Available Messaging Services

July 19, 2019  |  Posted By: Marty Abbott

When AKF Partners uses the term asynchronous, we use it in the logical rather than the physical (transport mechanism) sense.  Solutions that communicate asynchronously do not suspend execution and wait for a return – they move off to some other activity and resume execution should a response arrive. 

Asynchronous, non-blocking communications between service components help create resilient, fault isolated (limited blast radius) solutions. Unfortunately, while many teams spend a great deal of time ensuring that their services and associated data stores are scalable and highly available, they often overlook the solutions that tend to be the mechanism by which asynchronous communications are passed.  As such, these message systems often suffer from single points of failure (physical and logical), capacity constraints and may themselves represent significant failure domains if upon their failure no messages can be passed.

The AKF Scale Cube can help resolve these concerns.  The same axes that guide how we think about applications, servers, services, databases and data stores can also be applied to messaging solutions.

AKF Scale Cube for Messaging Services

X Axis

Cloning or duplication of messaging services means that anytime we have a logical service, we should have more than one available to process the same messages.  This goes beyond ensuring high availability of the service infrastructure for any given message queue, bus or service – it means that where one mechanism by which we send messages exist, another should be there capable of handling traffic should the first fail. 

As with all uses of the X axis, N messaging services (where N>1) can allow the passage of all similar messages.  Messages aren’t replicated across the instances, as doing so would eliminate the benefit of scalability.  Rather, messages are sent to one instance, but all producers and consumers send or consume to each of the N instances.  When an instance fails, it is taken out of rotation for production and when it returns its messages are consumed and producers can resume sending messages through it.  Ideally the solution is active-active with producers and consumers capable of interacting with all N copies as necessary.

Y Axis

The Y axis is segmentation by a noun (resource or message type) or verb (service or action).  There is very often a strong correlation between these.

Just as messaging services often have channels or types of communication, so might you segment messaging infrastructure by the message type or channel (nouns).  Monitoring messages may be directed to one implementation, analytics to a second, commerce to a third and so on.  In doing so, physical and logical failures can be isolated to a message type.  Unanticipated spikes in demand on one system, would not slow down the processing of messages on other systems.  Scale is increased through the “sharding” by message type, and messaging infrastructure can be increased cost effectively relative to the volume of each message type.

Alternatively, messaging solutions can be split consistent with the affinity between services.  Service A, B and C may communicate together but not need communication with D, E and F.  This affinity creates natural fault isolation zones and can be leveraged in the messaging infrastructure to isolate A, B and C from D, E and F.  Doing so provides similar benefits to the noun/resource approach above – allowing the solutions to scale independently and cost effectively.

Z Axis

Whereas the Y axis splits different types of things (nouns or verbs), the Z axis splits “similar” things.  Very often this is along a customer and geography boundary.  You may for instance implement a geographically distributed solution in multiple countries, each country having its own processing center.  Large countries by be subdivided, allowing solutions to exist close to the customer and be fault isolated from other geographic partitions.

Your messaging solution should follow your customer-geography partitions.  Why would you conveniently partition customers for fault isolation, low latency and scalability but rely on a common messaging solution between all segments?  A more elegant solution is to have each boundary have its own messaging solution to increase fault tolerance and significantly reduce latency.  Even monitoring related would ideally be handled locally and then forwarded if necessary, to a common hub.

We have held hundreds of on-site and remote architectural 2 and 3-day reviews for companies of all sizes in addition to thousands of due diligence reviews for investors. Contact us to see how we can help!

Subscribe to the AKF Newsletter

Contact Us

Architectural Principle: Use Commodity Hardware (and Cloud tools)

July 18, 2019  |  Posted By: Pete Ferguson

use Commodity Hardware - Cheaper is better most of the time. Focus on the need, not the frills. Why? Excess capabilities beyond the need add cost for little additional value

The Tail (pun intended) of Dorothy-Boy the Goldfish

When my now-adult son was 5, he was constantly enamored in the pet aisle of the local superstore of the vast variety of fish of many sizes and colors and eventually convinced us to buy a goldfish.

We paid under $20, bowl, food, rocks, props and all, and “Dorothy-boy” came home with us (my son’s idea for a name, not mine). Of course, there were several mornings that Dorothy-boy was found upside down and a quick trip to the store and good scrubbing of the bowl remedied potential heartbreak before my son even knew anything was wrong. 

Contrast that to my grandmother’s beloved Yorkie Terrier, Sergeant. When Sarge got sick, my grandmother spent thousands on doctor’s office visits, specialized food, and several surgeries. The upfront cost of a well-bred dog was significant enough, the annual upkeep for poor little Sarge was astronomical, but he lived a good, spoiled, and well-loved life.

That is why at AKF we often use the analogy of “goldfish, not thoroughbreds” with our clients to help them make decisions on hardware and software solutions.

Implement only what you need, when you need it, avoiding extraneous features and capabilities. Why? Repeatable, incremental systems combine cost effectiveness with smaller impact of failures and easier additions to scale

If a “pizzabox” 1U Dell or HP (or pick your brand) server dies, no biggie, probably have a few others laying around or can purchase and spin up new ones in days, not months or quarters of a year. Also allows for quickly adding additional web servers, application servers, test servers, etc. The cost per compute cycle is very low and can be scaled very quickly and affordably.

“Cattle not pets” is another way to think about hardware and software selection. When it comes to your next meal (assuming you are not a vegetarian), what is easier to eat with little thought? A nameless cow or your favorite pet?

If your vendor is sending you on annual vacations (err, I mean business conferences) and providing your entire team with tons of swag, you are likely paying way too much in upfront costs and ongoing maintenance fees, licensing, and service agreements. Sorry, your sales rep doesn’t care that much about you; they like their commissions based on high markups better.

Having an emotional attachment to your vendors is dangerous as it removes objectivity in evaluating what is best for your company’s customers and future.

Untapped Capacity at a Great Cost

It is not uncommon for monolithic databases and mainframes to be overbuilt given the upfront cost resulting in only being utilized at 10-20% of capacity. This means there is a lot of untapped potential that is being paid for – but not utilized year-over-year.

Trying to replace large, propriety systems is very difficult due to the lump sum of capital investment required. It is placed on the CapEx budget SWAG year after year and struck early on in the budgeting process as a CFO either dictates or asks, “can you live without the upgrade for one more year?”

Graphic depicting unused compute/storage for large proprietary systems

We have one client that finally got budget approval for a major upgrade to a large system, and in addition to the substantial costs for hardware and licensing of software, they also have over 100 third-party consultants on-site for 18 months sitting in their cubicles (while they are short on space for their employees) to help with the transition. The direct and in-direct costs are massive, including the innovation that is not happening while resources are focused on an incremental upgrade to keep up, not get ahead.

The bloat is amazing and it is easy to see why startups and smaller companies build in the cloud and use opensource databases and in the process, erode market share from the industry behemoths with a fraction of the investment.

Commodities Defined

The goal of commodity systems and solutions is to get as much value for as minimal of an investment as possible. This allows us to build highly available and scalable solutions.

Focus on getting the maximum performance for the least amount of cost for:

  • Compute
  • Storage
  • Network

We often see an interesting dichotomy of architectural principles within aging companies – teams report there is “no money” for new servers to provide customers with a more stable platform, but hundreds of thousands of dollars sunk into massive databases and mainframes.

Vendor lock and budget lock are two reasons why going with highly customized and proprietary systems shackles a company’s growth.

Forget the initial costs for specialized systems – which are substantial – usually the ongoing costs for licensing, service agreements, software upgrade support, etc. required to keep a vendor happy would likely be more than enough to provide a moderately-sized company with plenty of financial headroom to build out many new redundant, and highly available, commodity servers and networks.

Properly implementing along all three axes of the AKF Scale Cube requires a lot of hardware and software - not easily accomplished if providing a DR instance of your database also means giving your first-born and second-born children to Oracle.

Does this principle apply to cloud?

With the majority of startups never racking a single server themselves, and many larger companies migrating to AWS/Azure/Google, etc. – you might think this principle does not apply in the new digital age.

But where there is a will (or rather, profit), there is a way … and as the race for who can catch up to Amazon for hosting market share continues, vendor-specific tools that drive up costs are just as much of a concern as proprietary hardware is in the self-hosting world.

Often our venture capitalists and investor clients ask us about their startup’s hosting fees and if they should be concerned with the cost outpacing financial growth or if it is usual to see costs rise so quickly. Amazon and others have a lot to gain for providing discounted or free trials of proprietary monitoring, database, and other enhancements in hopes that they can ensure better vendor lock – and fair enough – a service that you can’t get with the competition.

We are just as concerned with vendor lock-in the cloud as we are with vendor lock-in for self-hosted solutions during due diligence and architectural reviews of our clients.


  • Commodity hardware allows companies faster time to market, scalability, and availability
  • The ROI on larger systems can rarely compete as the costs are such a large barrier to entry and often compute cycles are underutilized
  • The same principles apply to hosted solutions – beware of vendor specific tools that make moving your platform elsewhere more difficult over time

We have held hundreds of on-site and remote architectural 2 and 3-day reviews for companies of all sizes in addition to thousands of due diligence reviews for investors. Contact us to see how we can help!

Subscribe to the AKF Newsletter

Contact Us

The Circuit Breaker Pattern - Dos and Don'ts

July 8, 2019  |  Posted By: Marty Abbott

Circuit Breaker Pattern Overview

The microservice Circuit Breaker pattern is an automated switch capable of detecting extremely long response times or failures when calling remote services or resources.  The circuit breaker pattern proxies or encapsulates service A making a call to remote service or resource B.  When error rates or response times exceed a desired threshold, the breaker “pops” and returns an appropriate error or message regarding the interface status.  Doing so allows calls to complete more quickly, without tying up TCP ports or waiting for traditional timeouts.  Ideally the breaker is “healing” and senses the recovery of B thereby resetting itself.


The circuit breaker analogy works well in that it protects a given circuit for calls in series.  Unfortunately, it misses the true analogy of tripping to protect the propagation of a failure to other components on other circuits.  We often use the term circuit breaker in our practice to refer to either the technique of fault isolation or the microservice pattern of handling service to service faults.  In this article, we use the circuit breaker consistent with the microservice meaning.
Microservice Circuit Breaker Overview

Problems the Circuit Breaker Fixes

Generally speaking, we consider service to service calls to be an anti-pattern to be avoided whenever possible due to the multiplicative effect of failure and the resulting lowering of availability.  There are, however, sometimes that you just can't get around making distant calls.  Examples are:

  1. Resource (e.g. database) Calls: Necessary to interact with ACID or NoSQL Solutions.
  2. Third Party Integrations: Necessary to interact with any third party.  While we prefer these to be asynchronous, sometimes they must be synchronous.

In these cases, it makes sense to add a component, such as the circuit breaker, to help make the service more resilient.  While the breaker won't necessarily increase the availability of the service in question, it may help reduce other secondary and tertiary problems such as the inability to access a service for troubleshooting or restoration upon failure.

Principles to Apply

  1. Avoid the need for circuit breakers whenever possible by treating calls in series as an anti-pattern.
  2. When calls must be made in series, attempt to use an asynchronous and non-blocking approach.
  3. Use the circuit breaker to help speed recovery and identification of failure, and free up communication sockets more quickly.

When to use the Circuit Breaker Pattern

  • Useful for calls to resources such as databases (ACID or BASE).
  • Useful for third party synchronous calls over any distance.
  • When internal synchronous calls can't otherwise be avoided architecturally, useful for service to service calls under your control.

Key Takeaways

The circuit breaker won't fix availability problems resulting from a failed service or resource.  It will make the effects of that failure more rapid which will hopefully:

  • Free up communication resources (like TCP sockets) and keep them from backing up.
  • Help keep shared upstream components (e.g. load balancers and firewalls) from similarly backing up and failing.
  • Help keep the failed component or service accessible for more rapid troubleshooting and alerting.
  • Always ensure to have alerts fired on breaker open situations to help aid in faster time to detect (TTD).

AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures.  Give us a call – we can help!



Subscribe to the AKF Newsletter

Contact Us

Role of Architectural Principles in Software Development and Systems Development

July 1, 2019  |  Posted By: Pete Ferguson

Picture of hand reviewing a blueprint from
You wouldn’t (hopefully) think of building a house without first sitting down with an architect to come up with a good plan.

While building a house is a waterfall process, that doesn’t mean we can throw out good architecture when moving to an Agile methodology in software development. Sound architectural principles allow team autonomy to select and ensure that any new significant design meets standards for high availability and scalability of your website, product, or service.

Good architectural principles ensure stability, compatibility, and reliability. Many post mortems after a major incident with which I’ve been involved have uncovered root causes resulting from teams not following agreed-upon architectural principles – and unfortunately more often than not – seeing that teams do not have written and followed architecture standards.

Architectural Principles Are Guidelines for Success

Often in Agile software development, we confuse the desire to innovate and do things in new and differentiating ways with a similar antagonist notion that we shouldn’t be shackled with rules and procedures. Many upstarts and smaller companies offer freedom from layers of policy and red tape that stifle out speed, time to delivery, and innovation – and adding in architectural standards and review boards can sound very bureaucratic.

Architectural Principles are not meant to be restrictive. When written and executed properly, they are meant to aid growth and ensure future success. Structured architecture should keep things simple, expandable, and resilient and help teams establish autonomy rather than anarchy.

Successful companies are able to balance consistency with speed, which ensures future efforts aren’t encumbered by bugs, having to refactor excessive amounts of code, or be haunted by the sins of past shortcuts. The key is to keep things simple and dependable!

… match the effort and approach to the complexity of the problem. Not every solution has the same complexity—take the simplest approach to achieve the desired outcome.

(Abbott, Martin L. Scalability Rules. Pearson Education. Kindle Edition.)

AKF Architectural Principles

On the many technical due diligence and extended workshops I’ve attended in my tenure with AKF, I’ve seen how successful companies comply – and struggling companies avoid – the following principles:

AKF Architectural Principles Slide: Use Mature Technology, Use Commodity Hardware, Scale Out - Not Up, Isolate Faults, Design for all three Axes of the AKF Scale Cube, Design for Multiple Live and Independent Sites, N+2 Design, and Design to be Disabled

AKF Architectural Principles Slide 2: Microservices for Breadth, Libraries for Depth - decompose monoliths, Buy when non-core, Build small, release small, fail fast - crawl walk run, automate everything, use stateless systems, always design asynchronous communications, design to be monitored, and design for rollback and to be disabled

Everything we develop should be based on a set of architectural principles and standards that define and guide what we do. Successful software engineering teams employ architectural review boards to meet with teams and review existing and planned systems to ensure principles are being followed. Extremely successful companies have a culture that constantly looks at ways to better implement their agreed-upon architectural principles so that review boards are only a second set of eyes instead of a policing force.

With 12 years of product architecture and strategy experience, AKF Partners is uniquely positioned to be your technology partner. Let us know how we can help your organization.

This is the first of a series of articles that will go into greater depth of each of the above principles:

Image Credit -

Subscribe to the AKF Newsletter

Contact Us

Architectural Principles - Fault Isolation & Swimlanes

July 1, 2019  |  Posted By: Pete Ferguson

top down picture of swimming competition highlighting the distinct swimlanes in a swimming pool

This is one of several articles on recommended architectural principles and goes into deeper depth to our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” or more commonly – “Swim lanes” or “Swim-laned Architectures”.  We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.

Fault Isolation Defined

A “swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of the said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:

  1. Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
  2. Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of failure is contained. As such, and depending upon approach, only a portion of users or a portion of the functionality of the product is affected. This is akin to circuit breakers in your house – the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker tripping, preserving power to devices which are not affected.

Architecting Fault Isolation

A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations, similar to how a floating line of buoys keeps each swimmer in his or her lane during a race. In order to achieve this in systems architecture, ideally there are no synchronous calls between swimlanes or failure domains made pursuant to a user request.

User-initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with an appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services.

detailed view of a swim lane architecture showing how asynchronous calls should be fault isolated

It is acceptable, but not advisable, to have asynchronous calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services.

As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and web servers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.

The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Diagram of infrastructure showing how to establish swim lane fault isolation zones

Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.

The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.

Applying Fault Isolation with AKF’s Scale Cube

As we have indicated with the AKF Scale Cube in the past, there are many ways in which to think about swimlaned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation. 

Diagram showing how each data center should be isolated from each other to limit a failure to that data center only
Fault isolation in X-Axis would mean replicating everything for high availability – and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost-effective solutions for recovering from a disaster.

Diagram showing how different portions of a website should be isolated so that a failure with favorites item doesn't also fail the shopping cart
Fault Isolation in the Y-Axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) with each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to its swim lane. 

The example above of a commerce site shows different components of the page broken down into sections for login, buy again, promotions, shopping cart, and checkout. Each component would reside within separate applications, hosted on different servers with properly isolated services.

diagram showing how customers should be segmented across each data center so that one data center failure would not take down all customers
Another approach would be to perform a separation of your customer base or separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z-Axis swimlane along customer, order number, or product ID lines. More beneficially, if we are interested in the fastest possible response times to customers, we may split along geographic boundaries with each pointing to the closest data center within that region. Besides contributing to faster customer response times, these implementations can also help ensure we are compliant with data sovereignty laws (GDPR for example) unique to different countries or even states within the US.

Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve high availability through fault isolation.  Contact us to see how we can help you achieve the same fault tolerance.

AKF Partners helps companies create highly available, fault-isolated swim lane solutions.  Send us a note - we’d love to help you!

Subscribe to the AKF Newsletter

Contact Us

Microservice Bulkhead Pattern - Dos and Don'ts

June 27, 2019  |  Posted By: Marty Abbott

Bulkhead Pattern Overview

Bulkheads in ships separate components or sections of a ship such that if one portion of a ship is breached, flooding can be contained to that section.  Once contained, the ship can continue operations without risk of sinking.  In this fashion, ship bulkheads perform a similar function to physical building firewalls, where the firewall is meant to contain a fire to a specific section of the building.

The microservice bulkhead pattern is analogous to the bulkhead on a ship.  By separating both functionality and data, failures in some component of a solution do not propagate to other components.  This is most commonly employed to help scale what might be otherwise monolithic datastores.  The bulkhead is then a pattern for implementing the AKF principle of “swimlanes” or fault isolation.

Bulkhead pattern usage

Problems the Bulkhead Pattern Fixes

The bulkhead pattern helps to fix a number of different quality of service related issues.

  • Propagation of Failure:  Because solutions are contained and do not share resources (storage, synchronous service-to-service calls, etc), their associated failures are contained and do not propagate.  When a service suffers a programmatic (software) or infrastructure failure, no other service is disrupted.
  • Noisy Neighbors:  If implemented properly, network, storage and compute segmentation ensure that abnormally large resource utilization by a service does not affect other services outside of the bulkhead (fault isolation zone).
  • Unusual Demand:  The bulkhead protects other resources from services experiencing unpredicted or unusual demand.  Other resources do not suffer from TCP port saturation, resulting database deterioration, etc.

Principles to Apply

  1. Share Nearly Nothing:  As much as possible, services that are fault isolated or placed within a bulkhead should not share databases, firewalls, storage, load balancers, etc.  Budgetary constraints may limit the application of unique infrastructure to these services.  The following diagram helps explain what should never be shared, and what may be shared for cost purposes.  The same principles apply, to the extent that they can be managed, within IaaS or PaaS implementations.
  2. Bulkhead pattern usage
  3. Avoid synchronous calls to other services:  Service to service calls extend the failure domain of a bulkhead.  Failures and slowness transit blocking synchronous calls and therefore violate the protection offered by a bulkhead.

Put another way, the dimensions of a bulkhead or failure domain is the largest boundary across which no critical infrastructure is shared, and no synchronous inter-service calls exist. 

Anti-Patterns to Avoid

The following anti-patterns each rely on either synchronous service to service communication or sharing of data solutions.  As such, they represent solutions that should not be present within a bulkhead.

When to use the Bulkhead Pattern

  • Apply the bulkhead pattern whenever you want to scale a service independent of other services.
  • Apply the bulkhead pattern to fault isolate components of varying risk or availability requirements.
  • Apply the bulkhead pattern to isolate geographies for the purposes of increased speed/reduced latency such that distant solutions do not share or communicate and thereby slow response times.

AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures.  Give us a call – we can help!

Subscribe to the AKF Newsletter

Contact Us

 1 2 3 >  Last ›


Most Popular: