Hosting Lessons from Harvey and Irma
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
How an AI bot beat the world's best gamers
September 5, 2017 | Posted By: Roger Andelin
Last month, a bot developed by OpenAI (co-founded by Elon Musk) beat the world’s best, pro Dota 2 players. This is another milestone accomplishment in the field of artificial intelligence and machine learning and more fuel for the fire of concerns surrounding the AI debate. However, before we jump into that debate, here is some background you should understand about the technology fueling this debate.
The Evolution of Traditional Programming
A lot of what computer programming is can be simplified into three steps. First step, read in some data. Second step, do something with that data. Third step, output some result.
For example, imagine you want to fly somewhere for the weekend. You may first go to your travel app and input some dates, times, number of people traveling, airports, etc. Second, the app uses that information to search its database of available flights. Third, it returns a list of available flights for you to see.
This approach to software design has been the norm since the earliest days of programming. Artificial intelligence, in particular machine learning, has changed that approach. The first step is still the same: Read in some data. The third step is the same: Output some result.
However, with artificial intelligence technologies like machine learning, the second step, doing something with the data, is very different. In the example of finding a flight, a programmer easily can read the software code to understand the sequence of steps the computer has been programmed to do to produce the output data. If the programmer wants to change or improve the program’s behavior she can do that by writing new code or by altering the existing code. For example, if you wanted to compare the prices for available flights near the dates you have selected, a programmer can easily change several lines of code in the program to do just that. The programming code identifies every step the computer takes to arrive at its output. Said another way, the program only does what it’s specifically told to do in the code, nothing more or nothing less.
By contrast, the output of today’s most common machine learning programs is not determined by instructions written in computer code. There is no code for a programmer to read or modify when a change is desired. The output is determined by the program’s neural network.
Neural Networks in Action
What is a neural network? At the core of a neural network is a neuron. Similar to a traditional computer program, a neuron takes some input data, does a mathematical calculation on that data and then outputs some data. A typical neuron in a neural network will receive as input hundreds to thousands of numbers, typically between 0 and 1. A neuron will then multiply each number by a weight and sum the result of all the numbers. Many neurons will then convert the result into a number between 0 and 1. That result is then sent to the next neuron in sequence until the final output neuron is reach.
Here is an example of the math a typical neuron will do: If “x1, x2, x3…” represents input data and “w1, w2, w3…” represent the weights stored in the neuron, the calculation done by the neuron in a neural network looks like this: x1*w1 + x2*w2 + x3*w3 and so forth.
You can think of the calculation inside the neuron in a different way: The neuron is reading in a bunch of numbers and the weights in the neuron determine the importance, “or weight” of that input in producing the output. If the input is not important the weight for the input will be near zero and the input is not passed along to the next neuron. Therefore, the weights in a neuron effectively decided what input is valuable and what input should be ignored.
In a neural network, neurons like the one I described above are connected in parallel and in series to create a matrix of neurons. The input data to a neural network will go into hundreds or thousands of neurons in parallel, all with different weights. The output of those neurons is then sent to another layer of neurons and so forth, usually multiple layers deep. This is called a deep neural network. Another way to look at this is the neurons are grouped into a matrix of rows and columns, all interconnected. The final layer of the neural network is the output layer. Therefore, the final output of a neural network is the result of millions of calculations done by the neurons of that network.
When a programmer creates a neural network in software, the weights for each neuron are initially just random numbers. In other words, the weights arbitrarily decide to either diminish, increase or leave the input data alone, and output from the network is random. However, through a process called training, the weights move from randomly assigned values to values that can produce very useful outputs.
Training is both a time consuming and complicated mathematical process. However, it is much like training you and I would do to get better at something. For example, let’s say I wanted to learn how to shoot and arrow with a bow. I might pick up the bow and arrow, point it at the target, pull back the string and release. In my case, I know the arrow would miss the target. Therefore, I would try again and again making corrections to my aim based on on how far and which direction I was off from the target.
During the training process for a neural network, the weights in each of the neurons are changed slightly to improve the output, or “aim.” The most common approach for making those changes is called backpropagation. Backpropagation is a mathematical approach for applying corrections to every weight in every neuron of the network. During training, input is fed into the network and output is generated. The output is compared to the desired target and the difference between the output and the target is the error. Using the error, backpropagation makes changes to the weights in each neuron to reduce the error. If all goes well during training and backpropagation, the output error diminishes until it reaches expert or better than expert level.
AI vs Humans
In the case of the OpenAI Dota bot that recently beat the world’s best Dota 2 player, the outputs, which were a sequence of steps, strategies and decisions, went from random moves to moves that were so good the bot was able to easily defeat the best pro players in the world. The critical information that enabled the bot to win its matches was stored in the weights of the neurons and the neural network architecture itself.
A good question at this point is to ask if a programmer looking at the Dota 2 bot’s neural network could understand the steps taken by the bot to beat the human player. The answer is no. A programmer can see areas of the neural network that influence an output but it is not possible to explain why the bot took specific steps to formulate its moves and strategies. All the programmer would see is a huge matrix of weights that would be quite overwhelming to interpret.
Another good question to ask is whether or not a program written traditionally by a programmer with step by step instructions could beat the best Dota 2 player. The answer is no. Step by step programs where the programmer specifically instructs the computer to do something would easily be defeated by a professional player. However, a neural network can learn from training things that a programmer would never have the knowledge to program, store that learning in its neurons and use that learning to do things like defeat a human pro.
What makes the Dota 2 bot special is that it learned to beat the best pro players by playing against itself whereas most machine learning programs learn from training on data given to it by a programmer. In machine learning, good training data is like gold. It’s scarce and valuable. (note: This is one reason why Google and other big tech companies want to collect so much data.) Data is used to train neural networks to do useful things like recognizing people and places in your pictures or recognizing your voice from others in your family. OpenAI built a bot that learned almost entirely by playing against itself with the exception of some coaching provided by the OpenAI team. OpenAI has shown clearly that learning can occur without having tons of training data. It’s a little like being able to make gold.
Does the development of the OpenAI dota bot mean bots can now decide to train against themselves and become super bots? No. But it does say that humans can now program two bots to train against each other to become superbots. The key enabler being us. It’s anyone’s guess what type of bot can be imagined and developed in this way, useful or harmful. Obviously to most, a gaming superbot seems pretty innocuous, except of course to the gamer who may unexpectedly run into one during a match. However, it’s not hard to imagine super bots that are not so harmless. Or, perhaps you can just imagine a time when someone trains a bot to play football against itself until the bot becomes better at calling plays and strategy than every coach in the NFL. What happens then? The answer is disruption. Are you ready for it?
AKF Partners recommends that boards and executives direct their teams to identify sources of innovation and patterns of disruption that AI techniques may represent within their respective markets Walmart is already working on facial recognition technology in their stores to determine whether or not shoppers are satisfied at checkout. Will this give them a potential advantage over Amazon? How can machine learning and AI help you prevent fraud in your payment systems or the use of your commerce system to launder money?
AKF is prepared to help answer that question and others you may be facing. We will help you craft your AI strategy, sort through the hype, help you find the opportunities, and identify the potential threats of AI technology to your business.
Reach out to AKF
When Should You Split Services?
April 3, 2017 | Posted By: AKF
The Y axis of the AKF Scale Cube indicates that growing companies should consider splitting their products along services (verb) or resources (noun) oriented boundaries. A common question we receive is “how granular should one make a services split?” A similar question to this is “how many swim lanes should our application be split into?” To help answer these questions, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your application should be grouped into a large, monolithic codebases or split up into smaller individual services and swim lanes. You must also keep in mind that splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served focusing their resources on developing a marketable product than by fine tuning their service sizes using the considerations below.
Frequency of Change – Services with a high rate of change in a monolithic codebase cause competition for code resources and can create a number of time to market impacting conflicts between teams including product merge conflicts. Such high change services should be split off into small granular services and ideally placed in their own fault isolative swim lane such that the frequent updates don’t impact other services. Services with low rates of change can be grouped together as there is little value created from disaggregation and a lower level of risk of being impacted by updates.
The diagram below illustrates the relationship we recommend between functionality, frequency of updates, and relative percentage of the codebase. Your high risk, business critical services should reside in the upper right portion being frequently updated by small, dedicated teams. The lower risk functions that rarely change can be grouped together into larger, monolithic services as shown in the bottom left.
Degree of Reuse – If libraries or services have a high level of reuse throughout the product, consider separating and maintaining them apart from code that is specialized for individual features or services. A service in this regard may be something that is linked at compile time, deployed as a shared dynamically loadable library or operate as an independent runtime service.
Team Size – Small, dedicated teams can handle micro services with limited functionality and high rates of change, or large functionality (monolithic solutions) with low rates of change. This will give them a better sense of ownership, increase specialization, and allow them to work autonomously. Team size also has an impact on whether a service should be split. The larger the team, the higher the coordination overhead inherent to the team and the greater the need to consider splitting the team to reduce codebase conflict. In this scenario, we are splitting the product up primarily based on reducing the size of the team in order to reduce product conflicts. Ideally splits would be made based on evaluating the availability increases they allow, the scalability they enable or how they decrease the time to market of development.
Specialized Skills – Some services may need special skills in development that are distinct from the remainder of the team. You may for instance have the need to have some portion of your product run very fast. They in turn may require a compiled language and a great depth of knowledge in algorithms and asymptotic analysis. These engineers may have a completely different skillset than the remainder of your code base which may in turn be interpreted and mostly focused on user interaction and experience. In other cases, you may have code that requires deep domain experience in a very specific area like payments. Each of these are examples of considerations that may indicate a need to split into a service and which may inform the size of that service.
Availability and Fault Tolerance Considerations:
Desired Reliability – If other functions can afford to be impacted when the service fails, then you may be fine grouping them together into a larger service. Indeed, sometimes certain functions should NOT work if another function fails (e.g. one should not be able to trade in an equity trading platform if the solution that understands how many equities are available to trade is not available). However, if you require each function to be available independent of the others, then split them into individual services.
Criticality to the Business – Determine how important the service is to business value creation while also taking into account the service’s visibility. One way to view this is to measure the cost of one hour of downtime against a day’s total revenue. If the business can’t afford for the service to fail, split it up until the impact is more acceptable.
Risk of Failure – Determine the different failure modes for the service (e.g. a billing service charging the wrong amount), what the likelihood and severity of each failure mode occurring is, and how likely you are to detect the failure should it happen. The higher the risk, the greater the segmentation should be.
Scalability of Data – A service may be already be a small percentage of the codebase, but as the data that the service needs to operate scales up, it may make sense to split again.
Scalability of Services – What is the volume of usage relative to the rest of the services? For example, one service may need to support short bursts during peak hours while another has steady, gradual growth. If you separate them, you can address their needs independently without having to over engineer a solution to satisfy both.
Dependency on Other Service’s Data – If the dependency on another service’s data can’t be removed or handled with an asynchronous call, the benefits of disaggregating the service probably won’t outweigh the effort required to make the split.
Effort to Split the Code – If the services are so tightly bound that it will take months to split them, you’ll have to decide whether the value created is worth the time spent. You’ll also need to take into account the effort required to develop the deployment scripts for the new service.
Shared Persistent Storage Tier – If you split off the new service, but it still relies on a shared database, you may not fully realize the benefits of disaggregation. Placing a readonly DB replica in the new service’s swim lane will increase performance and availability, but it can also raise the effort and cost required.
Network Configuration – Does the service need its own subdomain? Will you need to make changes load balancer routing or firewall rules? Depending on the team’s expertise, some network changes require more effort than others. Ensure you consider these changes in the total cost of the split.
The illustration below can be used to quickly determine whether a service or function should be segmented into smaller microservices, be grouped together with similar or dependent services, or remain in a multifunctional, infrequently changing monolith.
Splitting Databases for Scale
April 3, 2017 | Posted By: AKF
The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic database and having multiple “cloned” applications servers retrieve and store data within that database. For young companies this is a very good approach as if done properly it will also eliminate the need for persistence or affinity to a given application server and as a result will increase customer perceived availability.
The problem, however, with this single monolithic data structure is threefold:
- Even with clustering technology (the existence of a second physical system or database that can take the load of the first in the event of failure), failures of the primary database will result in short service outages for 100% of the user community.
- This approach ultimately relies solely on technical improvements in cpu speed, memory access speed, memory access size, mass
storage access speeds and size, etc to insure the companies needs for scale.
- Relying upon (2) above in the extreme cases is not the most cost effective solutions as the newest and fastest technologies come at
a premium to older generations of technology and do not necessarily have the same processing power per dollar as older and/or
smaller (fewer cpus etc) systems.
As we have argued in the aforementioned post, a great engineering team will think about how to scale their platform well in advance of the need to rely solely upon partner technology advances. By making small modifications to our previously presented “Scale Cube”, the same concepts applied to the problem of splitting services for scale can be useful in addressing how to split a database for scale. As with the AKF Services Scale Cube, the AKF Database Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale transactions applied to a database. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic database – a case where all data is located in a single location and all accesses go to this single database.
The X Axis of the cube represents a means of spreading load across multiple instances of a replicated representation of the data. This is the first approach most companies use in scaling databases and is often both the easiest to implement and the least costly in both engineering time and hardware. Many third party and open source databases have native properties or functions that will allow the near real time replication of data to multiple “read databases”. The engineering cost of such an approach is low as typically database calls only need to be identified as a “read” or “write” and sent to the appropriate write database or bank of read databases. The “bank” of read databases should have reads evenly split across this if possible and many companies employ simple 3d party load balancers to perform this distribution.
Included in our Xaxis split are third party and open source caching solutions that allow reads to be split across “cache” hosts before actually reading from a database upon a cache miss. Caching is another simple way to reduce the load on the database but in our experience is not sufficient for hyper growth SaaS sites. If implemented properly, this Xaxis split also can increase availability as if replication is near real time, a read server can be promoted as the singular “write server” in the event of a “write server” failure. The combination of caching and read/write splits (our X axis) is sufficient for many companies but for companies with extreme hyper growth and massive data retention needs it is often not enough.
The Y Axis of our database cube represents a split by function, service or resource just as it did with the service cube. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in speeding up database calls by allowing more information specific to the request to be held in memory rather than needing to make a disk access. Just as with our approach in scaling services, our recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. It is possible to take a monolithic application and perform physical splits by say URL/URI to different service or resource oriented pools. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service / pool / resource / application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application and data set, especially when both the database and applications are split to represent a “failure domain” or fault isolative infrastructure.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). The most common way to view this is to consider splitting your resources by customer if your entity relationships allow that to happen. In the world of media, you might consider splitting it by article_id or media_id and in the world of commerce a split by product_id might be appropriate. In the case where you split customers from your products and perform splits within customers and products you would be implementing both a Y axis split (splitting by resource or call – customers and products) and a Z axis split (a
modulus of customers and products within their functional splits).
Z axis splits tend to be the most costly for an engineering team to perform as often many functions that might be performed within the database (joins for instance) now need to be performed within the application. That said, if done appropriately they represent the greatest potential for scale for most companies.
Splitting Applications or Services for Scale
April 3, 2017 | Posted By: AKF
Splitting Applications or Services for Scale
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and potentially communicating with a database. Many if not all of the functions are likely to exist within a monolithic application code base making use of the same physical and virtual resources of the system upon which the functions operate: memory, cpu, disk, network interfaces, etc. Potentially the engineers have the forethought to make the system highly available by positioning a second application server in the mix to be used in the event that the first application server fails.
This monolithic design will likely work fine for many sites that receive low levels of traffic. However, if the product is very successful and receives wide and fast adoption user perceived response times are likely to significantly degrade to the point that the product is almost entirely unusable. At some point, the system will likely even fail under the load as the inbound request rate is significantly greater than the processing power of the system and the resulting departure rate of responses to requests.
A great engineering team will think about how to scale their platform well in advance of such a catastrophic failure. There are many ways to approach how to think about such scalability of a platform and we present several through a representation of a three dimensional cube addressing three approaches to scale that we call the AKF Scale Cube.
The AKF Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale a service. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic service or product identified above: a product wherein all functions exist within a single code base on a single server making use of that server’s finite resources of memory, cpu speed, network ports, mass storage, etc.
The X Axis of the cube represents a means of spreading load across multiple instances of the same application and data set. This is the first approach most companies use to scale their services and it is effective in scaling from a request per second perspective. Oftentimes it is sufficient to handle the scale needs of a moderate sized business. The engineering cost of such an approach is low compared to many of the other options as no significant rearchitecting of the code base is required unless the engineering team needs to eliminate affinity to a specific server because the application maintains state. The approach is simple: clone the system and service and allow it to exist on N servers with each server handling 1/Nth the total requests. Ideally the method of distribution is a loadbalancer configured in a highly available manner with a passive peer that becomes active should the active peer fail as a result of hardware or software problems. We do not recommend leveraging roundrobin DNS as a method of load balancing. If the application does maintain state there are various ways of solving this including a centralized state service, redesigning for statelessness, or as a last resort using the load balancer to provide persistent connections. While the Xaxis approach is sufficient for many companies and distributes the processing of requests across several hosts it does not address other potential bottlenecks like memory constraints where memory is used to cache information or results.
The Y Axis of the cube represents a split by function, service or resource. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in reducing or distributing the amount of memory dedicated to any given application across several systems. A recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. As a quick first step, a monolithic application can be placed on multiple servers and dedicate certain of those servers to specific “services” or URIs. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service/pool/resource/application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as
“swimlaning” an application.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). As with the Y axis split, this split aids not only fault isolation, but significantly reduces the amount of memory necessary
(caching, etc) for most transactions and also reduces the amount of stabile storage to which the device/service needs attach. In this case, you might try a modulus by content id (article), or listing id, or a hash from the received IP address, etc. The Z axis split is often the most costly of all splits and we only recommend it for clients that have hypergrowth or very high rates of transaction. It should only be used after a company has implemented a very granular split along the Y axis. That said, it also can offer the greatest degree of scalability as the number of “swimlanes within swimlanes” that it creates is virtually limitless. For instance, if a company implements a Z axis split as a modulus of some transaction id and the implementation is a configurable number “N”, then N can be 10, 100, 1000, etc and each order of magnitude increase in N creates nearly an order of magnitude of greater scale for the company.
Scalability Best Practices
April 3, 2017 | Posted By: AKF
Here are a baker’s dozen of items that we feel are Best Practices for Scalability:
Use asynchronous communication when possible. Synchronous calls tie the availability of the two services together. If one has a failure or is slow the other one is affected.
2. Swim Lanes
Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
Replicate databases for recovery as well as to off load reads to multiple instances.
Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
7. Use Few RDBMS Features
Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
8. Slow Roll
Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
9. Load & Performance Testing
Test the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
10. Capacity Planning / Scalability Summits
Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
Always have the ability to rollback a code release.
12. Root Cause Analysis
Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
13. Quality From The Beginning
Quality can’t be tested into a product, it must be designed in from the beginning.
Fault Isolative Architectures or “Swimlaning”
April 3, 2017 | Posted By: AKF
Two of our previous articles, Splitting Databases for Scale and Splitting Applications or Services for Scale have made references to a concept that we call “Swimlaning Architectures”.
The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or approach in technology architecture.
In our definition, a “Swimlane” is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. The benefit of such a failure domain is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.
A “swimlaned” architecture is one in which each failure domain is completely isolated. In order to achieve this, ideally there are no calls between swimlanes or failure domains. Synchronous calls are absolutely forbidden in this type of architecture as any synchronous call between failure domains, even with appropriate timeout and detection mechanisms is very likely to cause a series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a call to any other service in another domain, to any service outside of the domain, or if the domain receives calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains. If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As we have previously indicated, a swimlane should have all of its services located within the failure domain. For instance, if database accesses are necessary the database with all appropriate information for that swimlane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swimlane. Furthermore, that database should not be used for other requests of service from other swimlanes. Our rule is one production database on one host.
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swimlaned architectures. You can think about them in terms of a separation of services e.g. “login” and “shopping cart” (two separate swimlanes) each having the web and app servers as well as all data stores located within the swimlane and answering only to systems within that swimlane. Corresponding to the Scale Cube we have previously introduced this would be a “Y” axis swimlane.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog.
Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swimlane along customer, order number or product id lines.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.
Build v. Buy
April 3, 2017 | Posted By: AKF
In many of our engagements, we find ourselves helping our clients understand when it’s appropriate to build and when they should buy.
If you perform a simple web search for “build v. buy” you will find hundreds of articles, process flows and decision trees on when to build and when to buy. Many of these are costcentric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. Does this “thing” (product / architectural component / function) create strategic differentiation in our business?
Here we are talking about whether you are creating switching costs, lowering barriers to exit, increasing barriers to entry, etc that would give you a competitive advantage relative to your competition. See Porter’s Five Forces for more information about this topic. If the answer to this question is “No – it does not create competitive differentiation” then 99% of the time you should just stop there and attempt to find a packaged product, open source solution, or outsourcing vendor to build what you need. If the answer is “Yes”, proceed to question 2.
2. Are we the best company to create this “thing”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else. For instance, if you are a social networking site, you *probably* don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”. And please, don’t fool yourselves – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. Are there few or no competing products to this “thing” that you want to create?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision. If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity. Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time. As a result, a “build” decision today will look bad tomorrow as features converge and pricing declines. If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. Can we build this “thing” cost effectively?
Is it cheaper to build than buy when considering the total lifecycle (implementation through endoflife)
of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc. If your business REALLY grows and is extremely successful, do you really want to be continuing to support internally developed load balancers, databases, etc. through the life of your product? Don’t fool yourself into answering this affirmatively just because you want to work on something neat. Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing”, but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes”.
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a
concern) anytime you answer “No” to any question above.