AKF Partners

Abbott, Keeven & Fisher Partners Partners in Technology

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

Monitoring the Good, the Bad, the Ugly for Improved Fault Detection

July 8, 2018  |  Posted By: Robin McGlothin

AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident.  Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.




A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents.  A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.

Our conversations with clients confirm that detecting these failures is a significant problem.  AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them!  Application-level failures can sometimes take days to detect, though they are repaired quickly once found.  Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
 
                The duration of a product impairment is TTR.

To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening.  Then, rely upon application and system monitoring to inform you on where and what has failed.  Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates.  We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.

In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:

  1. Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc.  Use math to determine abnormal behavior.  This is the first line of defense.
  2. Synthetic Transactions (Simulate User Behavior):  Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope.  Alerts from a passive monitor can be confirmed or denied and escalated as appropriate.  This is the second line of defense.

Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!) 

If you can’t identify all problem areas, identify as many as possible.  The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly!  That’s when things start falling apart and people point fingers.



At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site.  Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators.  Effective monitoring will detect these faults as well. 



The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.

In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control.  The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified.  If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected.  Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.


One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems.  An alternative to SPC is First and Second Derivative testing.  These tests tell if the actual and expected curve forms are the same.



Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay. 

We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations.  The data was graphed week over week.  Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys.  These graphs were displayed in the Network Operations Center, which was staffed 24x7.  Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.

Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive.  The NOC quickly initiated the SEV1 process and technical resources checked their areas.  The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower.  Roughly 20 minutes into the SEV1 process, the root cause was identified.  The finale episode of American Idol was being broadcast.  Our site was fine – but our customers had other things on their mind.  The business metric monitors worked – they gave warning of an aberrant usage pattern.

How would your company react to this critical change in normal usage patterns?  Use business metric monitors to detect workload shifts.


RELATED CONTENT

Subscribe to the AKF Newsletter

Contact Us

Eight Reasons To Avoid Stored Procedures

June 18, 2018  |  Posted By: Pete Ferguson

In my short tenure at AKF, I have found the topic of Stored Procedures (SPROCs) to be provocatively polarizing.  As we conduct a technical due diligence with a fairly new upstart for an investment firm and ask if they use stored procedures on their database, we often get a puzzled look as though we just accused them of dating their sister and their answer is a resounding “NO!”

However, when conducting assessments of companies that have been around awhile and are struggling to quickly scale, move to a SaaS model, and/or migrate from hosted servers to the cloud, we find “server huggers” who love to keep their stored procedures on their database.

At two different clients earlier this year, we found companies who have thousands of stored procedures in their database.  What was once seen as a time-saving efficiency is now one of several major obstacles to SaaS and cloud migration.

AKF Scalability Rules Why Stored Procedures Shouldn't be Saved on the Database

In our book, Scalability Rules: Principles for Scaling Web Sites, (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites) Marty outlines many reasons why stored procedures should not be kept in the database, here are the top 8:

  1. Cost: Databases tend to be one of the most expensive systems or services within the system architecture.  Each transaction cost increases with each additional SPROC.  Increase cost of scale by making a synchronous call to the ERP system for each transaction – while also reducing the availability of the product platform by adding yet another system in series – doesn’t make good business sense.
  2. Creates a Monolith: SPROCs on a database create a monolithic system which cannot be easily scaled.
  3. Limits Scalability: The database is a governor of scale, SPROCS steal capacity by running other than relational transactions on the database.
  4.  
  5. Limits Automated Testing: SPROCs limit the automation of code testing (in many cases it is not as easy to test stored procedures as it is the other code that developers write), slowing time to market and increasing cost while decreasing quality.
  6. Creates Lockin: Changing to an open-source or a NoSQL solution requires the need to develop a plan to migrate SPROCs or replace the logic in the application.  It also makes it more difficult to switch to new and compelling technologies, negotiate better pricing, etc.
  7. Adds Unneeded Complexity to Shard Databases: Using SPROCs and business logic on the database makes sharding and replacement of the underlying database much more challenging.
  8. Limits Speed To The Weakest Link: Systems should scale independently relative to individual needs. When business logic is tied to the database, each of them needs to scale at the same rate as the system making requests of them - which means growth is tied to the slowest system.
  9. More Team Composition Flexibility: By separating product and business intelligence in your platform, you can also separate the teams that build and support those systems.  If a product team is required to understand how their changes impact all related business intelligence systems, it will slow down their pace of innovation as it significantly broadens the scope when implementing and testing product changes and enhancements.

Per the AKF Scale Cube, we desire to separate dissimilar services - having stored procedures on the database means it cannot be split easily.

Need help migrating from hosted hardware to the cloud or migrating your installed software to a SaaS solution?  We have helped hundreds of companies from small startups to well-established Fortune 50 companies better architect, scale, and deliver their products.  We offer a host of services from technical due diligences, onsite workshops, and provide mentoring and interim staffing for your company.

RELATED CONTENT

 

Subscribe to the AKF Newsletter

Contact Us

Multi-Tenant Defined

June 11, 2018  |  Posted By: Marty Abbott

Of the many SaaS operating principles, perhaps one of the most misunderstood is the principle of tenancy.

Most people have a definition in their mind for the term “multi-tenant”.  Unfortunately, because the term has so many valid interpretations its usage can sometimes be confusing.  Does multi-tenant refer to the physical or logical implementation of our product?  What does multi-tenant mean when it comes to an implementation in a database?

This article first covers the goals of increasing tenancy within solutions, then delves into the various meanings of tenancy.

Multi-Tenant (Multitenant) Solutions and Cost

One of the primary reasons why companies that present products as a service strive for higher levels of tenancy is the cost reduction it affords the company in presenting a service.  With multiple customers sharing applications and infrastructure, system utilization goes up:  We get more value production out of each server that we use, or alternatively we get greater asset utilization.  Because most companies view the cost of serving customers as a “Cost of Goods Sold’, multitenant solutions have better gross margins than single-tenant solutions.  The X Axis of the figure below shows the effect of increasing tenancy on the cost of goods sold on a per customer basis:

On Prem vs ASP vs SaaS models and cost implications

Interestingly, multitenant solutions often “force” another SaaS principle to be true:  No more than 1 to 3 versions of software for the entire customer base.  This is especially true if the database is shared at a logical (row-level) basis (more on that later).  Lowering the number of versions of the product, decreases the operating expense necessary to maintain multiple versions and therefore also increases operating margins.

Single Tenant, Multi-Tenant and All-Tenant

An important point to keep in mind is that “tenancy” occurs along a spectrum moving from single-tenant to all-tenant.  Multitenant is any solution where the number of tenants from a physical or logical perspective is greater than one, including all-tenant implementations.  As tenancy increases, so does Cost of Goods Sold (COGS from the above figure) decrease and Gross Margins increase. 

The problem with All-Tenant solutions, while attractive from a cost perspective, is that they create a single failure domain [insert https://akfpartners.com/growth-blog/fault-isolation], thereby decreasing overall availability.  When something goes poorly with our product, everything is off line.  For that reason, we differentiate between solutions that enable multi-tenancy for cost reasons and all-tenant solutions. 

Multi-tenancy compared to single-tenant and all tenant

The Many Meanings and Implementations of Tenancy

Multitenant solutions can be implemented in many ways and at many tiers.

Physical and Logical

Physical multi-tenancy is having multiple customers share a number of servers.  This helps increase the overall utilization of these servers and therefore reduce costs of goods sold.  Customers need not share the application for a solution to be physically multitenant.  One could, for instance, run a webserver, application server or database per customer.  Many customers with concerns over data separation and privacy are fine with physical multitenancy as long as their data is logically separated.

Logical multi-tenancy is having data share the same application.  The same webserver instances, application server instances and database is used for any customer.  The situation becomes a bit murkier however when it comes to databases.

Different relational databases use different terms for similar implementations.  A SQLServer database, for instance, looks very much like an Oracle Schema.  Within databases, a solution can be logically multitenant by either implementing tenancy in a table (we call that row level multitenancy) or within a schema/database (we call that schema multitenancy).  In either case, a single instance of the relational database management system or software (RDBMS) is used, while customer transactions are separated by a customer id inside a table, or by database/schema id if separated as such.

While physical multitenancy provides cost benefits, logical multitenancy often provides significantly greater cost benefits.  Because applications are shared, we need less system overhead to run an application for each customer and thusly can get even greater throughput and efficiencies out of our physical or virtualized servers.

Depth of Multi-Tenancy

The diagram below helps to illustrate that every layer in our service architecture has an impact to multi-tenancy.  We can be physically or logically multi-tenant at the network layer, the web server layer, the application layer and the persistence or database layer.

The deeper into the stack our tenancy goes, the greater the beneficial impact (cost savings) to costs of goods sold and the higher our gross margins.

Review of tenancy options in the traditional deployment stack

The AKF Multi-Tenant Cube

To further the understanding of tenancy, we introduce the AKF Multi-Tenant Cube. 
Multi-tenancy and cost implications mapped by degree, mode and type of multi-tenancy

The X axis describes the “mode’ of tenancy, moving from shared nothing, to physical, to logical.  As we progress from sharing nothing to sharing everything, utilization goes up and cost of goods sold goes down.

The Y axis describes the depth of tenancy from shared nothing, through network, web, app and finally persistence or database tier.  Again, as the depth of tenancy increase, so do Gross Margins.

The Z axis describes the degree of tenancy, or the number of tenants.  Higher levels of tenancy decrease costs of goods sold, but architecturally we never want a failure domain that encompasses all tenants. 

When running a XaaS (SaaS, etc) business, we are best off implementing logical multitenancy through every layer of our architecture.  While we want tenancy to be high per instance, we also do not want all tenants to be in a single implementation.

AKF Partners helps companies of all sizes achieve their availability, time to market, scalability, cost and business goals.

RELATED CONTENT

 

Subscribe to the AKF Newsletter

Contact Us

4 Landmines When Using Serverless Architecture

May 20, 2018  |  Posted By: Dave Berardi

Physical Bare Metal, Virtualization, Cloud Compute, Containers, and now Serverless in your SaaS? We are starting to hear more and more about Serverless computing. Sometimes you will hear it called function as a service. In this next iteration of Infrastructure-as-a-Service, users can execute a task or function without having to provision a server, virtual machine, or any other underlying resource. The word Serverless is a misnomer as provisioning the underlying resources are abstracted away from the user, but they still exist underneath the covers. It’s just that Amazon, Microsoft, and Google manage it for you with their code. AWS Lambda, Azure Functions, and Google Cloud Functions are becoming more common in the architecture of a SaaS product. As technology leaders responsible for architectural decisions for scale and availability, we must understand its pros and cons and take the right actions to apply it.

Several advantages of serverless computing include:

• Software engineers can deploy and run code without having to manage any underlying infrastructure effectively creating a No-Ops environment.
• Auto-scaling is easier and requires less orchestration as compared to a containerized environment running services.
• True On-Demand capacity – no orphaned containers or other resources that might be idling.
• They are cost effective IF we are running the right size workloads.

Disadvantages and potential landmines to watch out for:

• Landmine #1 - No control over the execution environment meaning you are unable to isolate your operational environment. Compute and networking resources are virtualized with no visibility into either of them. Availability is the hands of our cloud provider and uptime is not guaranteed.
• Landmine #2 - SLAs cannot guarantee uptime. Start-up time can take a second causing latency that might not be acceptable.
• Landmine #3 - It’s going to become much easier for engineers to create code, host it rapidly, and forget about it leading to unnecessary compute and additional attack vectors creating a security risk.
• Landmine #4 - You will create vendor lock-in with your cloud provider as you set up your event driven functions to trigger from other AWS or Azure Services or your own services running on compute instances.

AKF is often asked about our position on serverless computing. There are 4 key rules considering the advantages and the landmines that we outlined:

1) Gradually introduce it into your architecture and use it for the right use cases
2) Establish architectural principles that guide its use in your organization that will minimize availability impact for Serverless. You will tie your availability to the FaaS in your cloud provider.
3) Watch out for a false sense of security among your engineering teams. Understand how serverless works before you use it and so you can monitor it for performance and availability.
4) Manage how and what it’s used for - monitor it (eg. AWS Cloud Watch) to avoid neglect and misuse along with cost inefficiencies.

AWS, Azure, or Google Cloud Serverless platforms could provide an affective computing abstraction in your architecture if it’s used for the right use cases, good monitoring is in place, and architectural principles are established.

AKF Partners has helped many companies create highly available and scalable systems that are designed to be monitored. Contact us for a free consultation.

Subscribe to the AKF Newsletter

Contact Us

Fault Isolation in Services Architectures

May 2, 2018  |  Posted By: AKF

Our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” and sometimes “Swim lanes” or “Swim-laned Architectures”.  We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.


Fault Isolation Defined
A “Swim lane” or fault isolation zone is a failure domain.  A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary.  Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:

1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced.  This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.  Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed.  Recovery time objectives (RTO) are subsequently decreased which increases overall availability.

2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform.  The “blast radius” of a failure is contained.  As such, and depending upon approach, only a portion of users or a portion of functionality of the product is affected.  This is akin to circuit breakers in your house - the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker.  Failure propagation is contained by the breaker popping and other devices are not affected. 

Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated.  We use the term “swim lanes” to depict the separations. In order to achieve this, ideally there are no synchronous calls between swim lanes or failure domains made pursuant to a user request.  User initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains.  Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services.  Again, “synchronous” is meant to identify a synchronous call (call, wait for a response) pursuant to any user request.

It is acceptable, but not advisable, to have asynchronouss calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain).  If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.

As previously indicated, a swim lane should have all of its services located within the failure domain.  For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swim lane.  Furthermore, that database should not be used for other requests of service from other swim lanes.  Our rule is one production database on one host.

The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Fault Isolation in Micro-Services Architectures

Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated.  These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.

The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.

Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swim laned architectures.  Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation. 

AKF Fault Isolation in the X-axis
Fault isolation in X-axis would mean replicating everything for high availability - and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion.  For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost effective solutions for recovering from disaster.

AKF Fault Isolation in the Y-axis
Fault Isolation in the Y-axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane.  Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to it’s swim lane. 

While purposely not legible (fuzzy) the fake example above shows different components of a fictional business account from a fictional bank.  Components of the page are separated with one component showing a summary, another component displaying more detailed information and still other components showing dynamic or static links - each derived from properly isolated services.

AKF Fault Isolation in the Z-axis
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog.  Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swim lane along customer, order number or product id lines.  More beneficially, if we are interested in fastest possible response times to customers, we may split along geographic boundaries.  We may have data centers (or IaaS regions) serving the West and East Coasts of the US respectively, the “Fly-Over States” of the US, and regions serving the EU, Canada, Asia, etc.  Besides contributing to faster perceived customer response times, these implementations can also help ensure we are compliant with data sovereignty laws unique to different countries or even states within the US.


Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.  AKF has helped achieve a high availability through fault isolation.  Contact us to see how we can help you achieve the same fault tolerance.

AKF Partners helps companies create highly available, fault isolated solutions.  Send us a note - we’d love to help you!

Subscribe to the AKF Newsletter

Contact Us

The Scale Cube

April 25, 2018  |  Posted By: Robin McGlothin

The Scale Cube - Architecting for Scale

The Scale Cube is a model for building resilient and scalable architectures using patterns and practices that apply broadly to any industry and all solutions. AKF Partners invented the Scale Cube in 2007, publishing it online in our blog in 2007 (original article here) and subsequently in our first book the Art of Scalability and our second book Scalability Rules


The Scale Cube (sometimes known as the “AKF Scale Cube” or “AKF Cube”) is comprised of an 3 axes: X-axis, Y-axis, and Z-axis.

    • Horizontal Duplication and Cloning (X-Axis )
    • Functional Decomposition and Segmentation - Microservices (Y-Axis)
    • Horizontal Data Partitioning - Shards (Z-Axis)

These axes and their meanings are depicted below in Figure 1.

AKF Scale Cube - X, Y and Z Axes Explained

                                    Figure 1

The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed and when existing systems are being improved. 

Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and likely communicate with a database. This monolithic design will work fine for relatively small applications that receive low levels of client traffic. However, this monolithic architecture becomes a kiss of death for complex applications.

A large monolithic application can be difficult for developers to understand and maintain. It is also an obstacle to frequent deployments. To deploy changes to one application component you need to build and deploy the entire monolith, which can be complex, risky, time consuming, require the coordination of many developers and result in long test cycles.

Consequently, you are often stuck with the technology choices that you made at the start of the project. In other words, the monolithic architecture doesn’t scale to support large, long-lived applications.

Figure 2, below, displays how the cube may be deployed in a modern architecture decomposing services (sometimes called micro-services architecture), cloning services and data sources and segmenting similar objects like customers into “pods”.

AKF Scale Cube - Examples of X, Y and Z axis splits

                                    Figure 2



Scaling Solutions with the X Axis of the Scale Cube

The most commonly used approach of scaling an solution is by running multiple identical copies of the application behind a load balancer also known as X-axis scaling. That’s a great way of improving the capacity and the availability of an application.

When using X-axis scaling each server runs an identical copy of the service (if disaggregated) or monolith. One benefit of the X axis is that it is typically intellectually easy to implement and it scales well from a transaction perspective.  Impediments to implementing the X axis include heavy session related information which is often difficult to distribute or requires persistence to servers – both of which can cause availability and scalability problems.  Comparative drawbacks to the X axis is that while intellectually easy to implement, data sets have to be replicated in their entirety which increases operational costs.  Further, caching tends to degrade at many levels as the size of data increases with transaction volumes.  Finally, the X axis doesn’t engender higher levels of organizational scale.

Figure 3 explains the pros and cons of X axis scalability, and walks through a traditional 3 tier architecture to explain how it is implemented.

AKF Scale Cube - X Axis Splits Pros and Cons

                                    Figure 3



Scaling Solutions with the Y Axis of the Scale Cube

Y-axis scaling (think services oriented architecture, micro services or functional decomposition of a monolith) focuses on separating services and data along noun or verb boundaries.  These splits are “dissimilar” from each other.  Examples in commerce solutions may be splitting search from browse, checkout from add-to-cart, login from account status, etc.  In implementing splits,  Y-axis scaling splits a monolithic application into a set of services. Each service implements a set of related functionalities such as order management, customer management, inventory, etc.  Further, each service should have its own, non-shared data to ensure high availability and fault isolation.  Y axis scaling shares the benefit of increasing transaction scalability with all the axes of the cube.

Further, because the Y axis allows segmentation of teams and ownership of code and data, organizational scalability is increased.  Cache hit ratios should increase as data and the services are appropriately segmented and similarly sized memory spaces can be allocated to smaller data sets accessed by comparatively fewer transactions.  Operational cost often is reduced as systems can be sized down to commodity servers or smaller IaaS instances can be employed.

Figure 4 explains the pros and cons of Y axis scalability and shows a fault-isolated example of services each of which has its own data store for the purposes of fault-isolation.

AKF Scale Cube - Y Axis Services Splits Pros and Cons

                                    Figure 4



Scaling Solutions with the Z Axis of the Scale Cube

Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things.  Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary.  Product catalogs may be split by SKU, and content may be split by content_id.  Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability.  Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.

Figure 5 explains the pros and cons of Z axis scalability and displays a fault-isolated pod structure with 2 unique customer pods in the US, and 2 within the EU.  Note, that an additional benefit of Z axis scale is the ability to segment pods to be consistent with local privacy laws such as the EU’s GDPR.

AKF Scale Cube - Z Axis Splits Pros and Cons

                                    Figure 5


Summary

Like Goldilocks and the three bears, the goal of decomposition is not to have services that are too small, or services that are too large but to ensure that the system is “just right” along the dimensions of scale, cost, availability, time to market and response times.


AKF Partners has helped hundreds of companies, big and small, employ the AKF Scale Cube to scale their technology product solutions.  We developed the cube in 2007 to help clients scale their products and have been using it since to help some of the greatest online brands of all time thrive and succeed.  For those interested in “time travel”, here are the original 2 posts on the cube from 2007:  Application Cube, Database Cube

Subscribe to the AKF Newsletter

Contact Us

Microservices for Breadth, Libraries for Depth

April 10, 2018  |  Posted By: Marty Abbott

The decomposition of monoliths into services, or alternatively the development of new products in a services-oriented fashion (oftentimes called microservices), is one of the greatest architectural movements of the last decade.  The benefits of a services (alternatively microservices or micro-services) approach are clear:

  • Independent deployment, decreasing time to market and decreasing time to value realization– especially when continuous delivery is employed.
  • Team velocity and ownership (informed by Conway’s Law).
  • Increased fault isolation – but only when properly deployed (see below).
  • Individual scalability – and the decreasing cost of operations that entails when properly architected.
  • Freedom of implementation and technology choices – choosing the best solution for each service rather than subjecting services to the lowest common denominator implementation.

Unfortunately, without proper architectural oversight and planning, improperly architected services can also result in:

  • Lower overall availability, especially when those services are deployed in one of a handful of microservice anti-patterns like the mesh, services in depth (aka the Christmas Tree Light String) and the Fuse.
  • Higher (longer) response times to end customers.
  • Complicated fault isolation and troubleshooting that increases average recovery time for failures.
  • Service bloat:  Too many services to comprehend (see our service sizing post)

The following are patterns companies should avoid (anti-patterns) when developing services or microservices architectures:


The Mesh

Mesh architectures, where individual services both “fan out” and “share” subsequent services result in the lowest possible availability. 


Deep Series

Services that are strung together in long (deep) call trees suffer from low availability and slow page response times as calculated from the product of each service offering availability. 


The Fuse

The Fuse is a much smaller anti-pattern than “The Mesh”.  In “The Fuse”, 2 distinct services (A and B) rely on service C.  Should service C become slow or unavailable, both service A and B suffer.


Architecture Principle:  Services – Broad, But Never Deep

These services anti-patterns protect against a lack of fault isolation, where slowness and failures propagate along a synchronous path.  One service fails, and the others relying upon that service also suffer. 

They also serve to guard against longer latency in call streams.  While network calls tend to be minimal relative to total customer response times, many solutions (e.g. payment solutions) need to respond as quickly as possible and service calls slow that down.

Finally, these patterns help protect against difficult to diagnose failures.  The Xmas Tree pattern name is chosen because of the difficulty in finding the “failed bulb” in old tree lights wired in series.  Similarly, imagine attempting to find the fault in “The Mesh”.  The time necessary to find faults negatively effects service restoration time and therefore availability.

As such, we suggest a principal that services should never be deep but instead should be deployed in breadth along product offering boundaries defined by nouns (resources like “customer” or “sales”) or verbs (services like “search” or “add to cart”).  We often call this approach “slices instead of layers”.
How then do we accomplish the separation of software for team ownership, and time to market where a single service would otherwise be too large or unwieldy?

Old School – Libraries!

When you need service-like segmentation in a deep call tree but can’t suffer the availability impact and latency associated with multiple calls, look to libraries.  Libraries will both eliminate the network associated latency of a service call.  In the case of both The Fuse and The Mesh libraries eliminate the shared availability constraints.  Unfortunately, we still have the multiplicative effect of failure of the Xmas Tree, but overall it is a faster pattern.

“But My Teams Can’t Release Separately!”

Sure they can – they just have to change how they think about releasing.  If you need immediate effect from what you release and don’t want to release the calling services with libraries compiled or linked, consider performing releases with shared objects or dynamically loadable libraries.  While these require restarts of the calling service, simple automation will help you keep from having an outage for the purpose of deploying software.


AKF Partners helps companies architecture highly available, highly scalable microservice architecture products.  We apply our aggregate experience, proprietary models, patterns, and anti-patterns to help ensure your products can meet your company’s scale and availability goals.  Contact us today - we can help!

Subscribe to the AKF Newsletter

Contact Us

SaaS Migration Challenges

March 12, 2018  |  Posted By: Dave Swenson

More and more companies are waking up from the 20th century, realizing that their on-premise, packaged, waterfall paradigms no longer play in today’s SaaS, agile world. SaaS (Software as a Service) has taken over, and for good reason. Companies (and investors) long for the higher valuation and increased margins that SaaS’ economies of scale provide. Many of these same companies realize that in order to fully benefit from a SaaS model, they need to release far more frequently, enhancing their products through frequent iterative cycles rather than massive upgrades occurring only 4 times a year. So, they not only perform a ‘lift and shift’ into the cloud, they also move to an Agile PDLC. Customers, tired of incurring on-premise IT costs and risks, are also pushing their software vendors towards SaaS.

But, what many of the companies migrating to SaaS don’t realize is that migrating to SaaS is not just a technology exercise.  Successful SaaS migrations require a ‘reboot’ of the entire company. Certainly, the technology organization will be most affected, but almost every department in a company will need to change. Sales teams need to pitch the product differently, selling a leased service vs. a purchased product, and must learn to address customers’ typical concerns around security. The role of professional services teams in SaaS drastically changes, and in most cases, shrinks. Customer support personnel should have far greater insight into reported problems. Product management in a SaaS world requires small, nimble enhancements vs. massive, ‘big-bang’ upgrades. Your marketing organization will potentially need to target a different type of customer for your initial SaaS releases - leveraging the Technology Adoption Lifecycle to identify early adopters of your product in order to inform a small initial release (Minimum Viable Product).

It is important to recognize the risks that will shift from your customers to you. In an on-premise (“on-prem”) product, your customer carries the burden of capacity planning, security, availability, disaster recovery. SaaS companies sell a service (we like to say an outcome), not just a bundle of software.  That service represents a shift of the risks once held by a customer to the company provisioning the service.  In most cases, understanding and properly addressing these risks are new undertakings for the company in question and not something for which they have the proper mindset or skills to be successful.

This company-wide reboot can certainly be a daunting challenge, but if approached carefully and honestly, addressing key questions up front, communicating, educating, and transparently addressing likely organizational and personnel changes along the way, it is an accomplishment that transforms, even reignites, a company.

This is the first in a series of articles that captures AKF’s observations and first-hand experiences in guiding companies through this process.


Don’t treat this as a simple rewrite of your existing product - answer these questions first…

Any company about to launch into a SaaS migration should first take a long, hard look at their current product, determining what out of the legacy product is not worth carrying forward. Is all of that existing functionality really being used, and still relevant? Prior to any move towards SaaS, the following questions and issues need to be addressed:

Customization or Configuration?
SaaS efficiencies come from many angles, but certainly one of those is having a single codebase for all customers. If your product today is highly customized, where code has been written and is in use for specific customers, you’ve got a tough question to address. Most product variances can likely be handled through configuration, a data-driven mechanism that enables/disables or otherwise shapes functionality for each customer. No customer-specific code from the legacy product should be carried forward unless it is expected to be used by multiple clients. Note that this shift has implications on how a sales force promotes the product (they can no longer promise to build whatever a potential customer wants, but must sell the current, existing functionality) as well as professional services (no customizations means less work for them).

Single/Multi/All-tenancy?
Many customers, even those who accept the improved security posture a cloud-hosted product provides over their own on-premise infrastructure, absolutely freak when they hear that their data will coexist with other customers’ data in a single multi-tenant instance, no matter what access management mechanisms exist. Multi-tenancy is another key to achieving economies of scale that bring greater SaaS efficiencies. Don’t let go of it easily, but if you must, price extra for it.

Who owns the data?
Many products focus only on the transactional set of functionality, leaving the analytics side to their customers. In an on-premise scenario, where the data resides in the customers’ facilities, ownership of the data is clear. Customers are free to slice & dice the data as they please. When that data is hosted, particularly in a multi-tenant scenario where multiple customers’ data lives in the same database, direct customer access presents significant challenges. Beyond the obvious related security issues is the need to keep your customers abreast of the more frequent updates that occur with SaaS product iterations. The decision is whether you replicate customer data into read-only instances, provide bulk export into their own hosted databases, or build analytics into your product?

All of these have costs - ensure you’re passing those on to your customers who need this functionality.

May I Upgrade Now?
Today, do your customers require permission for you to upgrade their installation? You’ll need to change that behavior to realize another SaaS efficiency - supporting of as few versions as possible. Ideally, you’ll typically only support a single version (other than during deployment). If your customers need to ‘bless’ a release before migrating on to it, you’re doing it wrong. Your releases should be small, incremental enhancements, potentially even reaching continuous deployment. Therefore, the changes should be far easier to accept and learn than the prior big-bang, huge upgrades of the past. If absolutely necessary, create a sandbox for customers to access new releases, but be prepared to deal with the potentially unwanted, non-representative feedback from the select few who try it out in that sandbox.

Wait? Who Are We Targeting?
All of the questions above lead to this fundamental issue: Are tomorrow’s SaaS customers the same as today’s? The answer? Not necessarily. First, in order to migrate existing customers on to your bright, shiny new SaaS platform, you’ll need to have functional parity with the legacy product. Reaching that parity will take significant effort and lead to a big-bang approach. Instead, pick a subset or an MVP of existing functionality, and find new customers who will be satisfied with that. Then, after proving out the SaaS architecture and related processes, gradually migrate more and more functionality, and once functional parity is close, move existing customers on to your SaaS platform.

To find those new customers interested in placing their bets on your initial SaaS MVP, you’ll need to shift your current focus on the right side of the Technology Adoption Lifecycle (TALC) to the left - from your current ‘Late Majority’ or ‘Laggards’ to ‘Early Adopters’ or ‘Early Majority’. Ideally, those customers on the left side of the TALC will be slightly more forgiving of the ‘learnings’ you’ll face along the way, as well as prove to be far more valuable partners with you as you further enhance your MVP.

The key is to think out of the existing box your customers are in, to reset your TALC targeting and to consider a new breed of customer, one that doesn’t need all that you’ve built, is willing to be an early adopter, and will be a cooperative partner throughout the process.


Our next article on SaaS migration will touch on organizational approaches, particularly during the build-out of the SaaS product, and the paradigm shifts your product and engineering teams need to embrace in order to be successful.

AKF has led many companies on their journey to SaaS, often getting called in as that journey has been derailed. We’ve seen the many potholes and pitfalls and have learned how to avoid them. Let us help you move your product into the 21st century.  See our SaaS Migration service

Subscribe to the AKF Newsletter

Contact Us

The Top 20 Technology Blunders

January 3, 2018  |  Posted By: AKF

One of the most common questions we get is “What are the most common failures you see tech and product teams make?”.  To answer that question we queried our database consisting of 11 years of anonymous client recommendations.  Here are the top 20 most repeated failures and recommendations:


1) Failing to design for rollback

If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW.

2) Confusing product release with product success

Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy.

3) Insular product development/engineering

How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work”.

4) Over engineering the solution

One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.

5) Allowing history to repeat itself

Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.

6) Scaling through 3d parties

Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.

7) Relying on QA to find your mistakes

You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering. Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.

8) Revolutionary or “big bang” fixes

In our experiences, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.

9) The Multiplicative Effect of Failure

Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc then the product of all of the service calls is your theoretical availability. 5 calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.

10) Failing to create and incent a culture of excellence

Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth.

11) Under-engineering for scale

The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now. That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.

12) “Not Built Here” Culture

We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point on relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.

13) A new PDLC will fix my problems

Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.

The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs.

14) We cannot hire great people quickly

Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.

15) It is a SPOF (Single Point of Failure) but we can recover it onto another host quickly

A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.

16) No Business Continuity plan

No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.

17) No Disaster Recovery Plan

Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS based company, the site and services provided is the company’s sole source of revenue. Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.

18) No Product Management team or person

In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.

19) It is okay to bring the site down to roll code

Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down. It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.

20) Firewalls, Firewalls, Everywhere!

We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.

Like this article?  Subscribe to the newsletter here.

Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them.  Give us a call or shoot us an email.  We’d love to help you achieve the success you desire.

Subscribe to the AKF Newsletter

Contact Us

Hosting Lessons from Harvey and Irma

September 19, 2017  |  Posted By: Greg Fennewald

Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.

Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.

What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.

Data Center Points of Failure

Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.

Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.

Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.

Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.

Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.

Preparing for the Inevitable

The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.

Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.

Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.

Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source:  FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.

Reach out to AKF

 

Subscribe to the AKF Newsletter

Contact Us

 1 2 >