Sidecar Pattern: The Dos and Don’ts
June 5, 2019 | Posted By: Marty Abbott
Sidecar Pattern Overview
The Sidecar Pattern is meant to allow the deployment of components of an application into separate, isolated, and encapsulated processes or containers. This pattern is especially useful when there is a benefit to sharing common components between microservices, as in the case of logging utilities, monitoring utilities, configuration routines, etc.
The Sidecar Pattern name is an analogy indicating the one-seat cars sometimes bolted alongside a motorcycle.
Benefits of Sidecar
Sidecar comes with many benefits:
- Use of multiple languages (polyglot) or technologies for each component. This is especially useful if a language is especially strong in a necessary area (e.g. Python for Machine Learning, or R for statistical work) or if an opensource solution can be leveraged to eliminate in-house specialization (e.g. the use of NGINX for certain network-related functions).
- Separation of what would otherwise be a monolith, and if used properly, fault isolation of associated services.
- Conceptually easy interactions between components similar to those provided by libraries, or service calls between microservices.
- Lower latency than traditional service calls to “other” services as the Sidecar lives in the same processing environment (VM or physical server) – albeit typically in a separate container.
- Similar to the use of libraries, allows for ownership by individual teams and organizational scalability of a larger team.
- Similar to the use of dynamically-loadable libraries, allows for independent release by teams of various shared usage components.
Drawbacks to Sidecar
Regardless of implementation (poly- or mono- glot), Sidecar has some drawbacks compared to the use of libraries:
- Higher inter-process communication latency – Because most implementations are service calls, the loopback interface on a system (127.0.0.1) will increase latency compared to the transition of call flow through memory.
- Size – especially in polyglot implementations but even in monoglot implementations – Containerization leads to multiple copies of similar libraries and increased memory utilization for comparable operations relative to the use of libraries.
- Environments – it is difficult to create any notion of fault isolation with Sidecar without containerization technologies. VM technologies (Sidecar in a VM separate from the host or calling solution) is not an option as it is then a Fan Out or Mesh anti-pattern rather than a local call.
When to Use Sidecar
Sidecar is a compelling alternative to libraries for cases where the increase in latency associated with local service messaging does not impact end-user response times. Examples of these are asynchronous logging, out of band monitoring, and asynchronous messaging capabilities. Circuit breakers (time-based request/response timeouts) are also a good example of a Sidecar implementation.
When to Avoid Sidecar
Never use a Sidecar Pattern for synchronous activities that must complete prior to generating a user response. Doing so will add some delay to end-user response times.
AKF also advises staying away from Sidecar for synchronous communications between services where doing so requires Sidecar to know all endpoints for each service. A specific example we advise against is having every endpoint (instance) of Service A (e.g. add-to-cart) know of every endpoint (instance) of Service B (e.g. decrement-SKU). A graphic of this example is given below:
The above graphic indicates the coordination between just two services and the instances that comprise that service. Imagine a case where all services may communicate to each other (as in the broader Mesh anti-pattern). Attempting to isolate faults becomes nearly impossible.
If Service A sometimes fails while calling Service B, how do you know which component is failing? Is it a failure in Service A, Service A’s Sidecar Proxy or Service B? Easier is to have a fewer number of proxies (albeit at a higher cost of latency given non-local communication) handle the transactions allowing for easier fault identification.
AKF Partners has helped hundreds of companies move from monolithic solutions to services and microservice architectures. Give us a call, we can help you with your transition.
Subscribe to the AKF Newsletter
What's the difference between VMs & Containers?
May 29, 2019 | Posted By: Robin McGlothin
VMs vs Containers
Inefficiency and down time have traditionally kept CTO’s and IT decision makers up at night. Now, new challenges are emerging driven by infrastructure inflexibility and vendor lock-in, limiting Technology more than ever and making strategic decisions more complex than ever. Both VMs and containers can help get the most out of available hardware and software resources while easing the risk of vendor lock-in limitation.
Containers are the new kids on the block, but VMs have been, and continue to be, tremendously popular in data centers of all sizes. Having said that, the first lesson to learn, is containers are not virtual machines. When I was first introduced to containers, I thought of them as light weight or trimmed down virtual instances. This comparison made sense since most advertising material leaned on the concepts that containers use less memory and start much faster than virtual machines – basically marketing themselves as VMs. Everywhere I looked, Docker was comparing themselves to VMs. No wonder I was a bit confused when I started to dig into the benefits and differences between the two.
As containers evolved, they are bringing forth abstraction capabilities that are now being broadly applied to make enterprise IT more flexible. Thanks to the rise of Docker containers it’s now possible to more easily move workloads between different versions of Linux as well as orchestrate containers to create microservices. Much like containers, a microservice is not a new idea either. The concept harkens back to service-oriented architectures (SOA). What is different is that microservices based on containers are more granular and simpler to manage. More on this topic in a blog post for another day!
If you’re looking for the best solution for running your own services in the cloud, you need to understand these virtualization technologies, how they compare to each other, and what are the best uses for each. Here’s our quick read.
VM’s vs. Containers – What’s the real scoop?
One way to think of containers vs. VMs is that while VMs run several different operating systems on one server, container technology offers the opportunity to virtualize the operating system itself.
Figure 1 – Virtual Machine Figure 2 - Container
VMs help reduce expenses. Instead of running an application on a single server, a virtual machine enables utilizing one physical resource to do the job of many. Therefore, you do not have to buy, maintain and service several servers. Because there is one host machine, it allows you to efficiently manage all the virtual environments with a centralized tool – the hypervisor. The decision to use VMs is typically made by DevOps/Infrastructure Team. Containers help reduce expenses as well and they are remarkably lightweight and fast to launch. Because of their small size, you can quickly scale in and out of containers and add identical containers as needed.
Containers are excellent for Continuous Integration and Continuous Deployment (CI/CD) implementation. They foster collaborative development by distributing and merging images among developers. Therefore, developers tend to favor Containers over VMs. Most importantly, if the two teams work together (DevOps & Development) the decision on which technology to apply (VMs or Containers) can be made collaboratively with the best overall benefit to the product, client and company.
What are VMs?
The operating systems and their applications share hardware resources from a single host server, or from a pool of host servers. Each VM requires its own underlying OS, and the hardware is virtualized. A hypervisor, or a virtual machine monitor, is software, firmware, or hardware that creates and runs VMs. It sits between the hardware and the virtual machine and is necessary to virtualize the server.
IT departments, both large and small, have embraced virtual machines to lower costs and increase efficiencies. However, VMs can take up a lot of system resources because each VM needs a full copy of an operating system AND a virtual copy of all the hardware that the OS needs to run. This quickly adds up to a lot of RAM and CPU cycles. And while this is still more economical than bare metal for some applications this is still overkill and thus, containers enter the scene.
Benefits of VMs
• Reduced hardware costs from server virtualization
• Multiple OS environments can exist simultaneously on the same machine, isolated from each other.
• Easy maintenance, application provisioning, availability and convenient recovery.
• Perhaps the greatest benefit of server virtualization is the capability to move a virtual machine from one server to another quickly and safely. Backing up critical data is done quickly and effectively because you can effortlessly create a replication site.
Popular VM Providers
• VMware vSphere ESXi, VMware has been active in the virtual space since 1998 and is an industry leader setting standards for reliability, performance, and support.
• Oracle VM VirtualBox - Not sure what operating systems you are likely to use? Then VirtualBox is a good choice because it supports an amazingly wide selection of host and client combinations. VirtualBox is powerful, comes with terrific features and, best of all, it’s free.
• Xen - Xen is the open source hypervisor included in the Linux kernel and, as such, it is available in all Linux distributions. The Xen Project is one of the many open source projects managed by the Linux Foundation.
• Hyper-V - is Microsoft’s virtualization platform, or ‘hypervisor’, which enables administrators to make better use of their hardware by virtualizing multiple operating systems to run off the same physical server simultaneously.
• KVM - Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. Specifically, KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple, isolated virtual environments called guests or virtual machines (VMs).
What are Containers?
Containers are a way to wrap up an application into its own isolated ”box”. For the application in its container, it has no knowledge of any other applications or processes that exist outside of its box. Everything the application depends on to run successfully also lives inside this container. Wherever the box may move, the application will always be satisfied because it is bundled up with everything it needs to run.
Containers virtualize the OS instead of virtualizing the underlying computer like a virtual machine. They sit on top of a physical server and its host OS — typically Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too. Shared components are read-only. Sharing OS resources such as libraries significantly reduces the need to reproduce the operating system code and means that a server can run multiple workloads with a single operating system installation. Containers are thus exceptionally light — they are only megabytes in size and take just seconds to start. Compared to containers, VMs take minutes to run and are an order of magnitude larger than an equivalent container.
In contrast to VMs, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program. This means you can put two to three times as many applications on a single server with containers than you can with VMs. In addition, with containers, you can create a portable, consistent operating environment for development, testing, and deployment. This is a huge benefit to keep the environments consistent.
Containers help isolate processes through differentiation in the operating system namespace and storage. Leveraging operating system native capabilities, the container isolates process space, may create temporary file systems and relocate process “root” file system, etc.
Benefits of Containers
One of the biggest advantages to a container is the fact you can set aside less resources per container than you might per virtual machine. Keep in mind, containers are essentially for a single application while virtual machines need resources to run an entire operating system. For example, if you need to run multiple instances of MySQL, NGINX, or other services, using containers makes a lot of sense. If, however you need a full web server (LAMP) stack running on its own server, there is a lot to be said for running a virtual machine. A virtual machine gives you greater flexibility to choose your operating system and upgrading it as you see fit. A container by contrast, means that the container running the configured application is isolated in terms of OS upgrades from the host.
Popular Container Providers
1. Docker - Nearly synonymous with containerization, Docker is the name of both the world’s leading containerization platform and the company that is the primary sponsor of the Docker open source project.
2. Kubernetes - Google’s most significant contribution to the containerization trend is the open source containerization orchestration platform it created.
3. Although much of early work on containers was done on the Linux platform, Microsoft has fully embraced both Docker and Kubernetes containerization in general. Azure offers two container orchestrators Azure Kubernetes Service (AKS) and Azure Service Fabric. Service Fabric represents the next-generation platform for building and managing these enterprise-class, tier-1, applications running in containers.
4. Of course, Microsoft and Google aren’t the only vendors offering a cloud-based container service. Amazon Web Services (AWS) has its own EC2 Container Service (ECS).
5. Like the other major public cloud vendors, IBM Bluemix also offers a Docker-based container service.
6. One of the early proponents of container technology, Red Hat claims to be “the second largest contributor to the Docker and Kubernetes codebases,” and it is also part of the Open Container Initiative and the Cloud Native Computing Foundation. Its flagship container product is its OpenShift platform as a service (PaaS), which is based on Docker and Kubernetes.
Uses for VMs vs Uses for Containers
Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs, but there are some general rules of thumb.
• VMs are a better choice for running applications that require all of the operating system’s resources and functionality when you need to run multiple applications on servers or have a wide variety of operating systems to manage.
• Containers are a better choice when your biggest priority is maximizing the number of applications running on a minimal number of servers.
Because of their small size and application orientation, containers are well suited for agile delivery environments and microservice-based architectures. When you use containers and microservices, however, you can easily have hundreds or thousands of components in your environment. You may be able to manually manage a few dozen virtual machines or physical servers, but there is no way you can manage a production-scale container environment without automation. The task of automating and managing a large number of containers and how they interact is known as orchestration.
Scalability of containerized workloads is a completely different process from VM workloads. Modern containers include only the basic services their functions require, but one of them can be a web server, such as NGINX, which also acts as a load balancer. An orchestration system, such as Google Kubernetes is capable of determining, based upon traffic patterns, when the quantity of containers needs to scale out; can replicate container images automatically; and can then remove them from the system.
For most, the ideal setup is likely to include both. With the current state of virtualization technology, the flexibility of VMs and the minimal resource requirements of containers work together to provide environments with maximum functionality.
If your organization is running many instances of the same operating system, then you should look into whether containers are a good fit. They just might save you significant time and money over VMs.
Subscribe to the AKF Newsletter
The Scale Cube: Achieve Security Through Scalability
May 14, 2019 | Posted By: James Fritz
If AKF Partners had to be known for one thing and one thing only it would be the Scale Cube. An ingenious little model designed for companies to identify how scalable they are and set goals along any of the three axes to make their product more scalable. Based upon the amount of times I have said scalable, or a derivative of the word scale, it should lead you to the conclusion that the AKF Scale Cube is about scale. And you would be right. However, the beauty of the cube is that is also applicable to Security.
The X-Axis is usually the first axis that companies look at for scalability purposes. The concept of horizontal duplication is usually the easiest reach from a technological standpoint, however it tends to be fairly costly. This replication across various tiers (web, application or database) also insulates companies when the inevitable breach does occur. Planning for only security without also bracing for a data breach is a naive approach. With replication across the tiers, and even delayed replication to protect against data corruption, not only are you able to accommodate more customers, you now potentially have a clean copy replicated elsewhere if one of your systems gets compromised, assuming you are able to identify the breach early enough.
One of the costliest issues with a breach is recovery to a secure copy. Your company may take a hit publicity wise, but if you are able to bring your system back up to a clean state, identify the compromise and fix it, then you are can be back on your way to fully operational. The reluctant acceptance that breaches occur is making its way into the minds of people. If you are just open and forthright with them, the publicity issue around a breach tends to be lessened. Showing them that your system is back up, running and now more secure will help drive business in the right direction.
Splitting across services (the Y-Axis) has many benefits beyond just scalability. It provides ownership, accountability and segregation. Although difficult to implement, especially if coming from a monolithic base, the benefits of these micro-services help with security as well. Code bases that communicate via asynchronous calls not only allow a service to fail without a major impact to other services, it creates another layer for a potential intruder to traverse.
Steps that can be implemented to provide a defense in depth of your environment help slow/mitigate attackers. If asynchronous calls are used between micro-services each lateral or vertical movement is another opportunity to be stopped or detected. If services are small enough, then once access is gained threats have less access to data than may be ideal for what they are trying to accomplish.
Segmenting customers based upon similar characteristics (be it geography, spending habits, or even just a random selection) helps to achieve Z-Axis scalability. These pods of customers provide protection from a full data breach as well. Ideally no customer data would ever be exposed, but if you have 4 pods, 25% of customer data is better than 100%. And just like the Y-Axis, these splits aid with isolating attackers into only a subset of your environment. Various governing boards also have different procedures that need to be followed depending upon the nationality of the customer data exposed. If segmented based upon that (eg. EU vs USA) then how you respond to breaches can be managed differently.
Now I Know My X, Y, Z’s
Sometimes security can take a back seat to product development and other functions within a company. It tends to be an afterthought until that fateful day when something truly bad happens and someone gains unauthorized access to your network. Implementing a scalable environment via the AKF Scale Cube achieves a better overall product as well as a more secure one.
If you need assistance in reaching a more scalable and secure environment AKF is capable of helping.
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: The Service Mesh
May 8, 2019 | Posted By: Marty Abbott
This article is the sixth in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits. Articles two and three cover anti-patterns for service and data fan out respectively. The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor. The fifth article expands the fuse metaphor from service fuses to data fuses.
Howard Anton, the author of my college Calculus textbook, was fond of the following phrase: “It should be intuitively obvious to the casual observer….”. The clause immediately following that phrase was almost inevitably something that was not obvious to anyone – probably not even the author. Nevertheless, the phrase stuck with me, and I think I finally found a place where it can live up to its promise. The Service Mesh, the topic of this microservice anti-pattern, is the amalgamation of all the anti-patterns to date. It contains elements of calls in series, fuses and fan out. As such, it follows the rules and availability problems of each of those patterns and should be avoided at all costs.
This is where I need to be very clear, as I’m aware that the Service Mesh has a very large following. This article refers to a mesh as a grouping of services with request/reply relationships. Or, put another way, a “Mesh” is any solution that violates repeatedly the anti-patterns of “tree lights”, “fuses” or “fan out”. If you use “mesh” to mean a grouping of services that never call each other, you are not violating this anti-pattern.
The reason mesh patterns are a bad idea are many-fold:
1) Availability: At the extreme, the mesh is subject to the equation: [N∗(N−1)]/2. This equation represents the number of edges in a fully connected graph with N vertices or nodes. Asymptotically, this reduces to N2. To make availability calculations simple, the availability of a complete mesh can be calculated as the service with the lowest availability (A)^(N*N). If the lowest availability of a service with appropriate X-axis cloning (multiple instances) is 99.9, and the service mesh has 10 different services, the availability of your service mesh will approximate 99.910. That’s roughly a 99% availability – perhaps good enough for some solutions but horrible by most modern standards.
2) Troubleshooting: When every node can communicate with every other node, or when the “connectiveness” of a solution isn’t completely understood, how does one go about finding the ailing service causing a disruption? Because failures and slowness transit synchronous links, a failure or slowness in one or more services will manifest itself as failures and slowness in all services. Troubleshooting becomes very difficult. Good luck in isolating the bad actor.
3) Hygiene: I recall sitting through computer science classes 30 years ago and hearing the term “spaghetti code”. These days we’d probably just call it “crap”, but it refers to the meandering paths of poorly constructed code. Generally, it leads to difficulty in understanding, higher rates of defects, etc. Somewhere along the line, some idiot has brought this same approach to deployments. Again, borrowing from our friend Anton, it should be intuitively obvious to the casual observer that if it’s a bad practice in code it’s also a bad practice in deployment architectures.
4) Cost to Fix: If points 1 through 3 above aren’t enough to keep you away from connected service meshes, point 4 will hopefully help tip the scales. If you implement a connected mesh in an environment in which you require high availability, you will spend a significant amount of time and money refactoring it to relieve the symptoms it will cause. This amount may approximate your initial development effort as you remove each dependent anti-pattern (series, fuse, fan-out) with an appropriate pattern.
Fixing a mesh is not an easy task. One solution is to ensure that no service blocks waiting for a request to complete of any other service. Unfortunately, this pattern is not always easy or appropriate to implement.
Another solution is to deploy each service as service when it is responding to an end user request, and as a library for another service wherever needed.
Finally, you can traverse each service node and determine where services can be collapsed or any of the other patterns identified within the tree light, fuse, or fanout anti-patterns.
AKF Partners helps companies create scalable, fault tolerant, highly available and cost effective architectures to meet their product needs. Give us a call, we can help
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: Data Fuse
May 8, 2019 | Posted By: Marty Abbott
This article is the fifth in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture), many of the mistakes or failure points teams create in services splits. Articles two and three cover anti-patterns for service and data fan out respectively. The fourth article covers an anti-pattern for disparate services sharing a common service deployment using the fuse metaphor.
The Data Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed data store. This data store can be any persistence solution from physical file services, to a common storage area network, to relational (ACID) or NoSQL (BASE) databases. When the shared data solution “C” fails, service A and B fail as well. Similarly, when data solution “C” becomes slow, slowness under high demand propagates to services A and B.
As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of data service C. Service B’s theoretical availability is calculated similarly. Problems with service A can propagate to service B through the “fused” data element. For instance, if service A experiences a runaway scenario that completely consumes the capacity of data store C, service B will suffer either severe slowness or will become unavailable.
The easiest pattern solution for the data fuse is simply to merge the separate services. This makes the most sense if the services can be owned by the same team. While availability doesn’t significantly increase (service A can still affect service B, and the data store C still affects both), we don’t have the confusion of two services interacting through a fuse. But if the rate of change for each service indicates that it needs separate teams, we need to evaluate other options (see ”when to split services” for a discussion on drivers of services splits.
Another way to fix the anti-pattern is to use the X axis of the Scale Cube as it relates to databases. An easy example of this is the sharing of account data between a sign-up service and a sign-in (AUTHN and AUTHZ) service. In this example, given that sign-up is a write-based service and sign-in is a read based service we can use the X axis of the Scale Cube and split the services on a read and write basis. To the extent that B must also log activity, it can have separate tables or a separate schema that allows that logging. Note that the services supporting this split need not be unique - they can in fact be the exact same service - but the traffic they serve is properly segmented such that the read deployment receives only read traffic and the write deployment receives only write traffic.
If reads and writes aren’t an easily created X axis split, or if we need the organizational scale engendered by a Y-axis split, we need to be a bit more creative. An example pattern comes from the differences between add-to-cart and checkout in a commerce solution. Some functionality is shared between the components, including the notion of showing calculated sales tax and estimated shipping. Other functionality may be unique, such as heavy computation in add-to-cart for related and recommended items, and up-sale opportunities such as gift wrapping or expedited shipping in checkout. We also want to keep carts (session data) around in order to reach out to customers who have abandoned carts, but we don’t want this ephemeral clutter clogging the data of checkout. This argues for separation of data for temporal (response time) reasons. It also allows us to limit PCI compliance boundaries, removing services (add to cart) from the PCI evaluation landscape.
Transition from add-to-cart to checkout may be accomplished through the client browser, or done as an asynchronous back end transfer of state with the browser polling for completion so as to allow for good fault isolation. We refactor the datastore to separate data to services along the Y axis of the scale cube.
AKF Partners helps companies create scalable, fault tolerant, highly available and cost-effective architectures to meet their product needs. Give us a call, we can help.
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: Service Fuse
April 27, 2019 | Posted By: Marty Abbott
This article is the fourth in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture). Many of the mistakes or failure points teams create in services splits. Articles two and three cover anti-patterns for service and data fan out respectively.
The Service Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed service pool. When the shared service “C” fails, service A and B fail as well. Similarly, when service “C” becomes slow, slowness under high demand propagates to services A and B.
As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of service C. Service B’s theoretical availability is calculated similarly. Under unusual conditions, the availability of A could also impact B similar to the way in which service fan out works. Such would be the case if A somehow holds threads for C, thereby starving it of threads to serve B.
Because overall availability is negatively impacted, we consider the Service Fuse to be a microservice anti-pattern.
The easiest and most common method to fault isolate the failure and response time propagation of Service C is to deploy it separately (in separate pools) for both Service A and B. In doing so, we ensure that C does not fail for one service as a result of unusual demand from the other. We also isolate failures due to unique requests that might be made by either A or B. In doing so, we do incur some additional operational costs and additional coordination and overhead in releases. But assuming proper automation, the availability and response time improvements are often worth the minor effort.
As with many of our other anti-patterns we can also employ dynamically loadable libraries rather than separate service deployments. While this approach has some of the slight overhead (again assuming proper automation) of the above separate service deployments, it often also benefits from significant server-side response time decreases associated with network transit.
We often see teams over emphasizing the cost of additional deployments. But the separate service deployment or dynamically loadable library deployment seldom results in significantly greater effort. Splitting the capacity of a shared pool relative to the demand split between services A and B (e.g. 50/50, 90/10, etc) and adding a small number of additional services for capacity is the real implication of such a split. Is 5 to 10% additional operational cost and seconds of additional deployment time worth the significant increase in availability? Our experience is that most of the time it is.
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: Data Fan Out
April 21, 2019 | Posted By: Marty Abbott
This article is the third in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern. The second article, Service Fan Out discusses the anti-pattern of a single service acting as a proxy or aggregator of mulitple services.
Data Fan Out, the topic of this microservice anti-pattern, exists when a service relies on two or more persistence engines with categorically unique data, or categorically similar data that is not meant to be processed in parallel. “Categorically Unique” means that the data is in no way related. Examples of categorical uniqueness would be a database that stores customer data and a separate database that stores catalog data. Instances of the same data, such as two separate databases each storing half of product catalog, are not categorically unique. Splitting of similar data is often known as sharding. Such “sharded” instances only violate the Data Fan Out pattern if:
1) They are accessed in series (database 1 is accessed and subsequently database 2 is accessed) –or-
2) A failure or slowness in either database, even if accessed in parallel, will result in a very slow or unavailable service.
Persistence engine means anything that stores data as in the case of a relational database, a NoSQL database, a persistent off-system cache, etc.
Anytime a service relies on more than one persistence engine to perform a task, it is subject to lower availability and a response time equivalent to the slower of the N data stores to which it is connected. Like the Service Fan Out anti-pattern, the availability of the resulting service (“Service A”) is the product of the availability of the service and its constituent infrastructure multiplied by the availability of each N data store to which it is connected.
Further, the response of the services may be tied to the slowest of the runtime of Service A added to the slowest of the connected solutions. If any of the N databases become slow enough, Service A may not respond at all.
Because overall availability is negatively impacted, we consider Data Fan Out to be a microservice anti-pattern.
One clear exception to the Data Fan Out anti-pattern is the highly parallelized querying done of multiple shards for the purpose of getting near linear response times out of large data sets (similar to one component of the MapReduce algorithm). In a highly parallelized case such as this, we propose that each of the connections have a time-out set to disregard results from slowly responding data sets. For this to work, the result set must be impervious to missing data. As an example of an impervious result set, having most shards return for any internet search query is “good enough”. A search for “plumber near me” returns 19/20ths of the “complete data”, where one shard out of 20 is either unavailable or very slow. But having some transactions not present in an account query of transactions for a checking account may be a problem and therefore is not an example of a resilient data set.
Our preferred approach to resolve the Data Fan Out anti-pattern is to dedicate services to each unique data set. This is possible whenever the two data sets do not need to be merged and when the service is performing two separate and otherwise isolatable functions (e.g. “Customer_Lookup” and “Catalog_Lookup”).
When data sets are split for scale reasons, as is the case with data sets that have both an incredibly high volume of requests and a large amount of data, one can attempt to merge the queried data sets in the client. The browser or mobile client can request each dataset in parallel and merge if successful. This works when computational complexity of the merge is relatively low.
When client-side merging is not possible, we turn to the X Axis of the Scale Cube for resolution. Merge the data sets within the data store/persistence engine and rely on a split of reads and writes. All writes occur to a single merged data store, and read replicas are employed for all reads. The write and read services should be split accordingly and our infrastructure needs to correctly route writes to the write service and reads to the read service. This is a valuable approach when we have high read to right ratios – fortunately the case in many solutions. Note that we prefer to use asynchronous replication and allow the “slave” solutions to be “eventually consistent” - but ideally still within a tolerable time frame of milliseconds or a handful of seconds.
What about the case where a solution may have a high write to read ratio (exceptionally high writes), and data needs to be aggregated? This rather unique case may be best solved by the Z axis of the AKF Scale Cube, splitting transactions along customer boundaries but ensuring the unification of the database for each customer (or region, or whatever “shard key” makes sense). As with all Z axis shards, this not only allows faster response times (smaller data segments) but engenders high scalability and availability while also allowing us to put data “closer to the customer” using the service.
AKF Partners helps companies create highly available, highly scalable, easily maintained and easily developed microservice architectures. Give us a call - we can help!
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: Service Fan Out
April 8, 2019 | Posted By: Marty Abbott
This article is the second in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern.
Fan Out, the topic of this microservice anti-pattern, exists when one service either serves as a proxy to two or more downstream services, or serves as an integration of two subsequent service calls. Any of the services (the proxy/integration service “A”, or constituent services “B” and “C”) can cause a failure of all services. When service A fails, service B and C clearly can’t be called. If either service B or C fails or becomes slow, they can affect service A by tying up communication ports. Ultimately, under high call volume, service A may become unavailable due to problems with either B or C.
Further, the response of the services may be tied to the slowest responding service. If A needs both B and C to respond to a request (as in the case of integration), then the speed at which A responds is tied to the slowest response times of B and C. If service A merely proxies B or C, then extreme slowness in either may cause slowness in A and therefore slowness in all calls.
Because overall availability is negatively impacted, we consider Service Fan Out to be a microservice anti-pattern.
One approach to resolve the above anti-pattern is to employ true asynchronous messaging between services. For this to be successful, the requesting service A must be capable of responding to a request without receiving any constituent service responses. Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A. One such example is a recommendation engine that returns other items a user might like to purchase. The absence of service B responding to A’s request for recommendations is unfortunate but doesn’t eliminate the value of A’s response completely.
As was the case with the Calls In Series Anti-Pattern, we may also be able to solve this anti-pattern with ”Libraries for Depth” pattern.
Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to a separately deployed service call. For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc. Additionally, call latency goes down without network interfaces.
The most common complaint about this pattern is that development teams cannot release independently. But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.
AKF Partners has helped to architect some of the most scalable, highly available, fault-tolerant and fastest response time solutions on the internet. Give us a call - we can help.
Subscribe to the AKF Newsletter
Microservice Anti-Pattern: Calls in Series (The Xmas Tree Light Anti-Pattern)
March 25, 2019 | Posted By: Marty Abbott
This article is the first in a multi-part series on microservices (micro-services) anti-patterns.
There are several benefits to carving up very large applications into service-oriented architectures. These benefits can include many of the following:
- Higher availability through fault isolation
- Higher organizational scalability through lower coordination
- Lower cost of development through lower overhead (coordination)
- Faster time to market achieved again through lower overhead of coordination
- Higher scalability through the ability to independently scale services
- Lower cost of operations (cost of goods sold) through independent scalability
- Lower latency/response time through better cacheability
The above should be considered only a partial list. See our articles on the AKF Scale Cube, and when you should split services for more information.
In order to achieve any of the above benefits, you must be very careful to avoid common mistakes.
Most of the failures that we see in microservices stem from a lack of understanding of the multiplicative effect of failure or “MEF”. Put simply, MEF indicates that the availability of any solution in series is a product of the availability of all components in that series.
Service A has an availability calculated by the product of its constituent parts. Those parts include all of the software and infrastructure necessary to run service A. The server availability, the application availability, associated library and runtime environment availabilities, operating system availability, virtualization software availability, etc. Let’s say those availabilities somehow achieve a “service” availability of “Five 9s” or 99.999 as measured by duration of outages. To achieve 99.999 we are assuming that we have made the service “highly available” through multiple copies, each being “stateless” in its operation.
Service B has a similar availability calculated in a similar fashion. Again, let’s assume 99.999.
If, for a request from any customer to Service A, Service B must also be called, the two availabilities are multiplied together. The new calculated availability is by definition lower than any service in isolation. We move our availability from 99.999 to 99.998.
When calls in series between services become long, availability starts to decline swiftly and by definition is always much smaller than the lowest availability of any service or the constituent part of any service (e.g. hardware, OS, app, etc).
This creates our first anti-pattern. Just as bulbs in the old serially wired Christmas Tree lights would cause an entire string to fail, so does any service failure cause the entire call stream to fail. Hence multiple names for this first anti-pattern: Christmas Tree Light Anti-Pattern, Microservice Calls in Series Anti-Pattern, etc.
The multiplicative effect of failure sometimes is worse with slowly responding solutions than with failures themselves. We can easily respond from failures through “heartbeat” transactions. But slow responses are more difficult. While we can use circuit breaker constructs such as hystrix switches – these assume that we know the threshold under which our call string will break. Unfortunately, under intense flash load situations (unforeseen high demand), small spikes in demand can cause failure scenarios.
One pattern to resolve the above issue is to employ true asynchronous messaging between services. To make this effective, the requesting service must not care whether it receives a response. This service must be capable of responding to a request without receiving any downstream response. Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A. One such example is a recommendation engine that returns other items a user might like to purchase. The absence of service B responding to A’s request for recommendations is unfortunate, but doesn’t eliminate the value of A’s response completely.
While the above pattern can resolve some use-cases, it doesn’t resolve most of them. Most often downstream services are doing more than “modifying” value for the calling service: they are providing specific necessary functions. These functions may be mail services, print services, data access services, or even component parts of a value stream such as “add to cart” and “compute tax” during checkout.
In these cases, we believe in employing the Libraries for Depth pattern.
Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to another service call. For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc. Additionally, call latency goes down without network interfaces.
The most common complaint about this pattern is that development teams cannot release independently. But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.
Subscribe to the AKF Newsletter
The AKF Partners Session State Cube
March 19, 2019 | Posted By: Marty Abbott
Tim Berners-Lee and his colleagues at CERN, the IETF and the W3C consortium all understood the value of being stateless when they developed the Hyper Text Transfer Protocol. Stateless systems are more resilient to multiple failure types, as no transaction needs to have information regarding the previous transaction. It’s as if each transaction is the first (and last) of its type.
First let’s quickly review three different types of state. This overview is meant to be broad and shallow. Certain state types (such as the notion of View state in .Net development) are not covered.
The Penalty (or Cost) of State
State costs us in multiple ways. State unique to a user interaction, or session state, requires memory. The larger the state, the more memory requirement, the higher cost of the server and the greater the number of servers we need. As the cost of goods sold increase, margins decrease. Further, that state either needs to be replicated for high availability, and additional cost, or we face a cost of user dissatisfaction with discrete component and ultimately session failures.
When application state is maintained, the cost of failure is high as we either need to pay the price of replication for that state or we lose it, negatively impacting customer experience. As memory associated with application state increases, so does the memory requirement and associated costs of the server upon which it runs. At high scale, that means more servers, greater costs, and lower gross margins. In many cases, we simply have no choice but to allow application state. Interpreters and java virtual machines need memory. Most applications also require information regarding their overall transactions distinct from those of users. As such, our goal here is not to eliminate application state but rather minimize it where possible.
When connection state is maintained, cost increases as more servers are required to service the same number of requests. Failures become more common as the failure probability increases with the duration of any connection over distance.
Our ideal outcome is to eliminate session state, minimize application state and eliminate connection state.
But What if I Really, Really, Really Need State?
Our experience is that simply saying “No” once or twice will force an engineer to find an innovative way to eliminate state. Another interesting approach is to challenge an engineer with a statement like “Huh, I heard the engineers at XYZ company figured out how to do this…”. Engineers hate to feel like another engineer is better than them…
We also recognize however that the complete elimination of state isn’t possible. Here are three examples (not meant to be all inclusive) of when we believe the principle of stateless systems should be violated:
Shopping carts need state to work. Information regarding a past transaction - (add_to_cart) for instance needs to be held somewhere prior to check_out. Given that we need state, now it’s just a question of where to store it. Cookies are good places. Distributed object caches are another location. Passing it through the URL in HTTP GET methods is a third. A final solution is to store it in a database.
No sane person wants to wrap debits and credits across distributed servers in a single, two-phase commit transaction. Banks have had a solution for this for years – the eventual consistent account transaction. Using a tiny workflow or state machine, debit in one transaction and eventually (ideally quickly) subsequently credit in a second transaction. That brings us to the notion of workflow and state machines in general.
What good is a state machine if it can’t maintain state? Whether application state or session state, the notion of state is critical to the success of each solution. Workflow systems are a very specific implementation of a state machine and as such require state. The trick with these is simply to ensure that the memory used for state is “just enough”. Govern against ever increasing session or application state size.
This brings us to the newest cube model in the AKF model repository:
The Session State Cube
The AKF State Cube is useful both for thinking through how to achieve the best possible state posture, and for evaluating how well we are doing against an aspiration goal (top right corner) of “Stateless”.
The X axis describes size of state. It moves from very large (XL) state size to the ideal position of zero size, or “No State”. Very large state size suffers from higher cost, higher impact upon failure, and higher probability of failure.
The Y axis describes the degree of distribution of state. The worst position, lower left, is where state is a singleton. While we prefer not to have state, having only one copy of it leaves us open to large – and difficult to recover from – failures and dissatisfied customers. Imagine nearly completing your taxes only to have a crash wipe out all of your work! Ughh!
Progressing vertically along the Y axis, the singleton state object in the lower left is replicated into N copies of that state for high availability. While resolving the recovery and failure issues, performing replication is costly both in extra memory and network transit. This is an option we hope to avoid for cost reasons.
Following replication are several methods of distribution in increasing order of value. Segmenting the data by some value “N” has increasing value as N increases. When N is 2, a failure of state impacts 50% of our customers. When N is 100, only 1% of our customers suffer from a state failure. Ideally, state is also “rebuildable” if we have properly scattered state segments by a shard key – allowing customers to only have to re-complete a portion of their past work.
Finally, of course, we hope to have “no state” (think of this as division by infinite segmentation approaching zero on this axis).
The Z Axis describes where we position state “physically”.
The worst location is “on the same server as the application”. While necessary for application state, placing session data on a server co-resident with the application using it doubles the impact of a failure upon application fault. There are better places to locate state, and better solutions than your application to maintain it.
A costly, but better solution from an impact perspective is to place state within your favorite database. To keep costs low, this could be an opensource SQL or NoSQL database. But remember to replicate it for high availability.
A less costly solution is to place state in an object cache, off server from the application. Ideally this cache is distributed per the Y axis.
The least costly solution is to have the client (browser or mobile app) maintain state. Use a cookie, pass the state through a GET method, etc.
Finally, of course the best solution is that it is kept “nowhere” because we have no state.
The AKF State Cube serves two purposes:
- Prescriptive: It helps to guide your team to the aspirational goal of “stateless”. Where stateless isn’t possible, choose the X, Y and Z axis closest to the notion of no state to achieve a low cost, highly available solution for your minimized state needs.
- Descriptive: The model helps you evaluate numerically, how you are performing with respect to stateless initiatives on a per application/service basis. Use the guide on the right side of the model to evaluate component state on a scale of 1 to 10.
AKF Partners helps companies develop world class, low cost of operations, fast time to market, stateless solutions every day. Give us a call! We can help!
Subscribe to the AKF Newsletter
Is the Co-location (colo) Industry Dying?
March 15, 2019 | Posted By: Marty Abbott
I’m no Nostradamus when it comes to predicting the future of technology, but some trends are just too blatantly obvious to ignore. Unfortunately, they are only easy to spot if you have a job where you are allowed (I might argue required) to observe broader industry trends. AKF Partners must do that on behalf of our clients as our clients are just too busy fighting the day-to-day battles of their individual businesses.
One such very concerning probability is the eventual decline – and one day potentially the elimination of – the colocation (hosting) business. Make no mistake about it – if you lease space from a colocation provider, the probability is high that your business will need to move locations, move providers, or experience a service disruption soon.
Let’s walk through the factors and trends that indicate, at least to me, that the industry is in trouble, and that your business faces considerable risks:
Sources of Demand for Colocation (Macro)
Broadly speaking, the colocation industry was built on the backs of young companies needing to lease space for compute, storage, and the like. As time progressed, more established companies started to augment privately-owned data centers with colocation facilities to avoid the burden of large assets (buildings, capital improvements and in some cases even servers) on their balance sheets.
The first source of demand, small companies, has largely dried up for colocation facilities. Small companies seek to be “asset light” and most frequently start their businesses running on Infrastructure as a Service (IaaS) providers (AWS, GCP, Azure etc.). The ease and flexibility of these providers enable faster time to market and easier operational configuration of systems. Platform as a Service (PaaS) offerings in many cases eliminate the need for specialized infrastructure and DevOps skill sets, allowing small companies to focus limited funds on software engineers that will help create differentiating experiences and capabilities. Five years ago, successful startups may have started migrating into colocation facilities to lower costs of goods sold (COGS) for their products, and in so doing increase gross margin (GM). While this is still an opportunity for many successful companies, few seem to take advantage of it. Whether due to vendor lock-in through PaaS services, or a preference for speed and flexibility over expenses, the companies tend to stay with their IaaS provider.
Larger, more established companies continue to use colocation facilities to augment privately-owned data centers. That said, in most cases technology refresh results in faster and more efficient compute. When the rate of compute increases faster than the rate of growth in transactions and revenue within these companies, they start to collapse the infrastructure assets back into wholly-owned facilities (assuming power, space, and cooling of the facilities are not constraints). Bringing assets back in-house to owned facilities lowers costs of goods sold as the company makes more efficient use of existing assets.
Simultaneously these larger firms also seek the flexibility and elasticity of IaaS services. Where they have new demand for new solutions, or as companies embark upon a digital transformation strategy, they often do so leveraging IaaS.
The result of these forces across the spectrum of small to large firms reduces overall demand. Reduced demand means a contraction in the colocation industry overall.
Minimum Efficient Scale and the Colocation Industry (Micro)
Data centers are essentially factories. To achieve optimum profitability, fixed costs such as the facility itself, and the associated taxes, must be spread across the largest possible units of production. In the case of data centers, this means achieving maximum utilization of the constraining factors (space, power, and cooling capacity) across the largest possible revenue base. Maximizing utilization against the aforementioned constraints drops the LRAC (long run average cost) as fixed costs are spread across a larger number of paying customers. This is the notion of Minimum Efficient Scale in economics.
As demand decreases, on a per data center (colocation facility) basis, fixed costs per customer increases. This is because less space is used, and the cost of the facility is allocated across fewer customers. At some point, on a per data center basis the facility becomes unprofitable. As profits dwindle across the enterprise, and as debt service on the facilities becomes more difficult, the colocation provider is forced to shut down data centers and consolidate customers. Assets are sold or leases terminated with the appropriate termination penalties.
Customers who wish to remain with a provider are forced to relocate. This in turn causes customers to reconsider colocation facilities, and somewhere between a handful to a majority on a per location basis will decide to move to IaaS instead. Thus begins a vicious cycle of data center shutdowns engendering ever-decreasing demand for colocation facilities.
Excluding other macroeconomic or secular events like another real estate collapse, smaller providers start to exit the colocation service industry. Larger providers benefit from the exit of smaller players and the remaining data centers benefit from increased demand on a dwindling supply, allowing those providers to regain MES and profitability.
Does the Trend Stop at a Smaller Industry?
We are likely to continue to see the colocation industry exist for quite some time – but it will get increasingly smaller. The consolidation of providers and dwindling supply of facilities will stop at some point, but just for a period. Those that remain in colocation facilities will either not have the means or the will to move. In some cases, a lack of skills within the remaining companies will keep them “locked into” a colocation. In other cases, competing priorities will keep an exit on the distant horizon. These “lock in” factors will give rise to an opportunity for the colocation industry to increase pricing for a time.
But make no mistake about it, customers will continue to leave – just at a decreased rate relative to today’s departures. Some companies will simply go out of business or contract in size and depart the data centers. Others will finally decide that the increasing cost of service is too high.
While it’s doubtful that the industry will go away in its entirety, it will be small and comparatively expensive. The difference between costs of colocation and costs to run in an IaaS solution will start to dwindle.
Risks to Your Firm
The risk to your firm comes in three forms, listed in increasing order of risk as measured by a function of probability of occurrence and impact upon occurrence:
- Pricing of service per facility. If you are lucky enough that your facility does not close, there is a high probability that your cost for service will increase. This in turn increases your cost of goods sold and decreases your gross margin.
- Risk of facility dissolution. There exists an increasingly high probability that the facilities in which you are located will be shut down. While you are likely to be given some advance notice, you will be required to move to another facility with the same provider, or another provider. There is both a real cost in the move, and an opportunity cost associated with service interruption and effort.
- Risk of firm as a going concern. Some providers of colocation services will simply exit the business. In some cases, you may be given very little notice as in the case of a company filing bankruptcy. Service interruption risk is high.
Strategies You Must Employ Today
In our view, you have no choice but to ensure that you are ready and able to easily move out of colocation facilities. Whether this be to existing data centers you own, IaaS providers, or a mix matters not. At the very least, we suggest your development and operations processes enable the following principles:
- Environment Agnosticism: Ensure that you can run in owned, lease, managed service, or IaaS locations. Ensuring consistency in deployment platforms, using container technologies and employing orchestration systems all aid in this endeavor.
- Hybrid Hosting: Operate out of at least two of the following three options as a course of normal business operations: owned data centers, leased/colocation facilities, IaaS.
- Dynamic Allocation of Demand: Prove on at least a weekly basis that you can operate any functionality within your product out of any location you operate – especially those that happen to be located within colocation facilities.
AKF Partners helps companies think through technology, process, organization, location, and hosting strategies. Let us help you architect a hybrid hosting solution that limits your risk to any single provider.
Subscribe to the AKF Newsletter
Don't Let the Tail Wag the Dog
February 22, 2019 | Posted By: Greg Fennewald
On multiple occasions over the years, we have heard our clients state a use case they want to avoid in product design sessions or as a reason for architectural choices made for existing products. These use cases can be given more credence than they deserve based on objective data – they become boogeyman legends, edge cases that can result in poor architectural choices.
One of our clients was debating the benefit of multiple live sites with customers pinned to the nearest site to minimize latency. The availability benefits of multiple live sites are irrefutable, but the customer experience benefit of less latency was questioned. This client had millions of clients spread across the country. The notion of pinning a client to a “home” site nearest them raised the question of “what happens when the client travels across the country?”. The answer is to direct them to that same home site. That client will experience more latency for the duration of the visit. The proportion of clients that spend 50% of their time on either coast is vanishingly small – keep it simple. Have a work around for clients that permanently move to a location served by a different site – client data resides in more than one location for DR purposes anyway, right?
This client also had hundreds of service specialists that would at times access client accounts and take actions on their behalf, and these service specialists were located near the west coast. Objections were made based on the latency a west coast service specialist would encounter when acting on the behalf of an east coast client whose data was hosted near the east coast. Millions of clients. Hundreds of service specialists. The math is not hard. The needs of the many outweigh the needs of the few.
A different client had a concern about data consistency upon new user registration for their service. To ensure a new customer could immediately transact, the team decided to deploy a single authentication server to preclude the possibility of a transaction following registration hitting an authentication server that had not yet received the registration data. Intentionally deploying a SPOF should have raised immediate objections but did not. The team deployed a passive backup server that required manual intervention to work.
The new user process flow was later revealed to be less than 3% of the overall transactions. 97% of the transactions suffered an impactful outage along with the 3% new users when the SPOF authentication server failed. Designing a workaround for the new users while employing a write master with multiple, load balanced read only slaves would provide far better availability. The needs of the many outweigh the needs of the few.
It is important to remain open minded during early design sessions. It is also important to follow architectural principles in the face of such use cases. How can one balance potentially conflicting concepts?
• Ask questions best answered with objective data.
• Strive for simplicity, shave with Occam’s Razor
• Validate whether the edge case is a deal breaker for the product owner
• Propose a work around that addresses the edge case while optimizing the architecture for the majority use case and sound principles.
Catering to the needs of the business while adhering to architectural standards is a delicate balancing act and compromises will be made. Everyone looks at the technologist when a product encounters a failure. Know when to hold the line on sound architectural principles that safeguard product availability and user experience. The product owner must understand and acknowledge the architectural risks resulting from product design decisions. The technologist must communicate these risks to the product owner along with objective data and options. A failure to communicate effectively can lead to the tail wagging the dog – do not let that happen.
With 12 years of product architecture and strategy experience, AKF Partners is uniquely positioned to be your technology partner. Learn more here.
Subscribe to the AKF Newsletter
The AKF Difference
December 4, 2018 | Posted By: Marty Abbott
During the last 12 years, many prospective clients have asked us some variation of the following questions: “What makes you different?”, “Why should we consider hiring you?”, or “How are you differentiated as a firm?”.
The answer has many components. Sometimes our answers are clear indications that we are NOT the right firm for you. Here are the reasons you should, or should not, hire AKF Partners:
Operators and Executives – Not Consultants
Most technology consulting firms are largely comprised of employees who have only been consultants or have only run consulting companies. We’ve been in your shoes as engineers, managers and executives. We make decisions and provide advice based on practical experience with living with the decisions we’ve made in the past.
Engineers – Not Technicians
Educational institutions haven’t graduated enough engineers to keep up with demand within the United States for at least forty years. To make up for the delta between supply and demand, technical training services have sprung up throughout the US to teach people technical skills in a handful of weeks or months. These technicians understand how to put building blocks together, but they are not especially skilled in how to architect highly available, low latency, low cost to develop and operate solutions.
The largest technology consulting companies are built around programs that hire employees with non-technical college degrees. These companies then teach these employees internally using “boot camps” – creating their own technicians.
Our company is comprised almost entirely of “engineers”; employees with highly technical backgrounds who understand both how and why the “building blocks” work as well as how to put those blocks together.
Product – Not “IT”
Most technology consulting firms are comprised of consultants who have a deep understanding of employee-facing “Information Technology” solutions. These companies are great at helping you implement packaged software solutions or SaaS solutions such as Enterprise Resource Management systems, Customer Relationship Management Systems and the like. Put bluntly, these companies help you with solutions that you see as a cost center in your business. While we’ve helped some partners who refuse to use anyone else with these systems, it’s not our focus and not where we consider ourselves to be differentiated.
Very few firms have experience building complex product (revenue generating) services and platforms online. Products (not IT) represent nearly all of AKF’s work and most of AKF’s collective experience as engineers, managers and executives within companies. If you want back-office IT consulting help focused on employee productivity there are likely better firms with which you can work. If you are building a product, you do not want to hire the firms that specialize in back office IT work.
Business First – Not Technology First
Products only exist to further the needs of customers and through that relationship, further the needs of the business. We take a business-first approach in all our engagements, seeking to answer the questions of: Can we help a way to build it faster, better, or cheaper? Can we find ways to make it respond to customers faster, be more highly available or be more scalable? We are technology agnostic and believe that of the several “right” solutions for a company, a small handful will emerge displaying comparatively low cost, fast time to market, appropriate availability, scalability, appropriate quality, and low cost of operations.
Cure the Disease – Don’t Just Treat the Symptoms
Most consulting firms will gladly help you with your technology needs but stop short of solving the underlying causes creating your needs: the skill, focus, processes, or organizational construction of your product team. The reason for this is obvious, most consulting companies are betting that if the causes aren’t fixed, you will need them back again in the future.
At AKF Partners, we approach things differently. We believe that we have failed if we haven’t helped you solve the reasons why you called us in the first place. To that end, we try to find the source of any problem you may have. Whether that be missing skillsets, the need for additional leadership, organization related work impediments, or processes that stand in the way of your success – we will bring these causes to your attention in a clear and concise manner. Moreover, we will help you understand how to fix them. If necessary, we will stay until they are fixed.
We recognize that in taking the above approach, you may not need us back. Our hope is that you will instead refer us to other clients in the future.
Are We “Right” for You?
That’s a question for you, not for us, to answer. We don’t employ sales people who help “close deals” or “shape demand”. We won’t pressure you into making a decision or hound you with multiple calls. We want to work with clients who “want” us to partner with them – partners with whom we can join forces to create an even better product solution.
Subscribe to the AKF Newsletter
The Importance of QA
November 20, 2018 | Posted By: Robin McGlothin
“Quality in a service or product is not what you put into it. It’s what the customer gets out of it.” Peter Drucker
The Importance of QA
High levels of quality are essential to achieving company business objectives. Quality can be a competitive advantage and in many cases will be table stakes for success. High quality is not just an added value, it is an essential basic requirement. With high market competition, quality has become the market differentiator for almost all products and services.
There are many methods followed by organizations to achieve and maintain the required level of quality. So, let’s review how world-class product organizations make the most out of their QA roles. But first, let’s define QA.
According to Wikipedia, quality assurance is “a way of preventing mistakes or defects in products and avoiding problems when delivering solutions or services to customers. But there’s much more to quality assurance.”
There are numerous benefits of having a QA team in place:
- Helps increase productivity while decreasing costs (QA HC typically costs less)
- Effective for saving costs by detecting and fixing issues and flaws before they reach the client
- Shifts focus from detecting issues to issue prevention
Teams and organizations looking to get serious about (or to further improve) their software testing efforts can learn something from looking at how the industry leaders organize their testing and quality assurance activities. It stands to reason that companies such as Google, Microsoft, and Amazon would not be as successful as they are without paying proper attention to the quality of the products they’re releasing into the world. Taking a look at these software giants reveals that there is no one single recipe for success. Here is how five of the world’s best-known product companies organize their QA and what we can learn from them.
Google: Searching for best practices
How does the company responsible for the world’s most widely used search engine organize its testing efforts? It depends on the product. The team responsible for the Google search engine, for example, maintains a large and rigorous testing framework. Since search is Google’s core business, the team wants to make sure that it keeps delivering the highest possible quality, and that it doesn’t screw it up.
To that end, Google employs a four-stage testing process for changes to the search engine, consisting of:
- Testing by dedicated, internal testers (Google employees)
- Further testing on a crowdtesting platform
- “Dogfooding,” which involves having Google employees use the product in their daily work
- Beta testing, which involves releasing the product to a small group of Google product end users
Even though this seems like a solid testing process, there is room for improvement, if only because communication between the different stages and the people responsible for them is suboptimal (leading to things being tested either twice over or not at all).
But the teams responsible for Google products that are further away from the company’s core business employ a much less strict QA process. In some cases, the only testing done by the developer responsible for a specific product, with no dedicated testers providing a safety net.
In any case, Google takes testing very seriously. In fact, testers’ and developers’ salaries are equal, something you don’t see very often in the industry.
Facebook: Developer-driven testing
Like Google, Facebook uses dogfooding to make sure its software is usable. Furthermore, it is somewhat notorious for shaming developers who mess things up (breaking a build or causing the site to go down by accident, for example) by posting a picture of the culprit wearing a clown nose on an internal Facebook group. No one wants to be seen on the wall-of-shame!
Facebook recognizes that there are significant flaws in its testing process, but rather than going to great lengths to improve, it simply accepts the flaws, since, as they say, “social media is nonessential.” Also, focusing less on testing means that more resources are available to focus on other, more valuable things.
Rather than testing its software through and through, Facebook tends to use “canary” releases and an incremental rollout strategy to test fixes, updates, and new features in production. For example, a new feature might first be made available only to a small percentage of the total number of users.
Canary Incremental Rollout
By tracking the usage of the feature and the feedback received, the company decides either to increase the rollout or to disable the feature, either improving it or discarding it altogether.
Amazon: Deployment comes first
Like Facebook, Amazon does not have a large QA infrastructure in place. It has even been suggested (at least in the past) that Amazon does not value the QA profession. Its ratio of about one test engineer to every seven developers also suggests that testing is not considered an essential activity at Amazon.
The company itself, though, takes a different view of this. To Amazon, the ratio of testers to developers is an output variable, not an input variable. In other words, as soon as it notices that revenue is decreasing or customers are moving away due to anomalies on the website, Amazon increases its testing efforts.
The feeling at Amazon is that its development and deployment processes are so mature (the company famously deploys software every 11.6 seconds!) that there is no need for elaborate and extensive testing efforts. It is all about making software easy to deploy, and, equally if not more important, easy to roll back in case of a failure.
Spotify: Squads, tribes and chapters
Spotify does employ dedicated testers. They are part of cross-functional teams, each with a specific mission. At Spotify, employees are organized according to what’s become known as the Spotify model, constructed of:
- Squads. A squad is basically the Spotify take on a Scrum team, with less focus on practices and more on principles. A Spotify dictum says, “Rules are a good start, but break them when needed.” Some squads might have one or more testers, and others might have no testers at all, depending on the mission.
- Tribes are groups of squads that belong together based on their business domain. Any tester that’s part of a squad automatically belongs to the overarching tribe of that squad.
- Chapters. Across different squads and tribes, Spotify also uses chapters to group people that have the same skillset, in order to promote learning and sharing experiences. For example, all testers from different squads are grouped together in a testing chapter.
- Guilds. Finally, there is the concept of a guild. A guild is a community of members with shared interests. These are a group of people across the organization who want to share knowledge, tools, code and practices.
Spotify Team Structure
Testing at Spotify is taken very seriously. Just like programming, testing is considered a creative process, and something that cannot be (fully) automated. Contrary to most other companies mentioned, Spotify heavily relies on dedicated testers that explore and evaluate the product, instead of trying to automate as much as possible. One final fact: In order to minimize the efforts and costs associated with spinning up and maintaining test environments, Spotify does a lot of testing in its production environment.
Microsoft: Engineers and testers are one
Microsoft’s ratio of testers to developers is currently around 2:3, and like Google, Microsoft pays testers and developers equally—except they aren’t called testers; they’re software development engineers in test (or SDETs).
The high ratio of testers to developers at Microsoft is explained by the fact that a very large chunk of the company’s revenue comes from shippable products that are installed on client computers & desktops, rather than websites and online services. Since it’s much harder (or at least much more annoying) to update these products in case of bugs or new features, Microsoft invests a lot of time, effort, and money in making sure that the quality of its products is of a high standard before shipping.
What you can learn from world-class product organizations? If the culture, views, and processes around testing and QA can vary so greatly at five of the biggest tech companies, then it may be true that there is no one right way of organizing testing efforts. All five have crafted their testing processes, choosing what fits best for them, and all five are highly successful. They must be doing something right, right?
Still, there are a few takeaways that can be derived from the stories above to apply to your testing strategy:
- There’s a “testing responsibility spectrum,” ranging from “We have dedicated testers that are primarily responsible for executing tests” to “Everybody is responsible for performing testing activities.” You should choose the one that best fits the skillset of your team.
- There is also a “testing importance spectrum,” ranging from “Nothing goes to production untested” to “We put everything in production, and then we test there, if at all.” Where your product and organization belong on this spectrum depends on the risks that will come with failure and how easy it is for you to roll back and fix problems when they emerge.
- Test automation has a significant presence in all five companies. The extent to which it is implemented differs, but all five employ tools to optimize their testing efforts. You probably should too.
Bottom line, QA is relevant and critical to the success of your product strategy. If you’d tried to implement a new QA process but failed, we can help.
Subscribe to the AKF Newsletter
September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-Axis, Y-Axis, and Z-Axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
Subscribe to the AKF Newsletter
Scaling Your Systems in the Cloud - AKF Scale Cube Explained
September 5, 2018 | Posted By: Pete Ferguson
Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems. While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize. Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.
We see more and more new startups in AWS or Azure – in addition to assisting well-established companies make the transition to the cloud. Regardless of the hosting platform, in our technical due diligence reviews we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)
This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.
Scalability Rules – Chapter 2: Distribute Your Work
Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.
So how did they do it? ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.
“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners)
The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.
At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability. The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube). Sometimes, it’s easier to see these three axes without the confined space of the cube.
X Axis – Horizontal Duplication
The X Axis allows transaction volumes to increase easily and quickly. If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.
A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer. This cloning allows the distribution of transactions across systems evenly for horizontal scale. Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed. Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time. To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.
Y Axis – Split by Function, Service, or Resource
Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy. The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.
In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption. This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.
Y Axis splits also apply to a noun approach. Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.
While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases. Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well. This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.
Z Axis – Separate Similar Things
Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology. Z Axis separation is useful for containerizing customers or a geographical replication of data. If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geography, or other subset of data.
This means that each larger customer or geography could have its own dedicated Web, application, and database servers. Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.
For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches. This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.
Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets. In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.
This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented. We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands). Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!
Subscribe to the AKF Newsletter
The Top 20 Technology Blunders
July 20, 2018 | Posted By: Pete Ferguson
One of the most common questions we get is “What are the most common failures you see tech and product teams make?” To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to Design for Rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW. (Related Content: Monitoring for Improved Fault Detection)
2) Confusing Product Release with Product Success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy! (Related Content: Agile and the Cone of Uncertainty)
3) Insular Product Development / Engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work.” (Related Content: The No Surprises Rule)
4) Over Engineering the Solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
Image Source: Hackernoon.com
5) Allowing History to Repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Vendor Lock
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to Find Your Mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering! Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “Big Bang” Fixes
In our experience, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure – Eliminate Synchronous Calls
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc. then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to Create and Incentivize a Culture of Excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. (Related Content: Three Reasons Your Software Engineers May Not Be Successful)
11) Under-Engineer for Scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now! That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point of relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A New PDLC will Fix My Problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance, in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs. (Related Content: The Top Five Most Common PDLC Failures)
14) Inability to Hire Great People Quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) Diminishing or Ignoring SPOFs (Single Point of Failure)
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity Plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge, like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS-based company, the site and services provided is the company’s sole source of revenue! Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
If you are cloud hosted, this still applies to you! We often find in technical due diligence reviews that small companies who are rapidly growing haven’t yet initiated a second active tech stack in a different availability zone or with a second cloud provider. Just because AWS, Azure and others have a fairly reliable track record doesn’t mean they always will. You can outsource services, but you still own the liability!
Image Source: Kaibizzen.com.au
18) No Product Management Team or Person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) Failing to Implement Continuously
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down (back to #17 - multiple active sites also allows for continuous implementation and the ability to roll back). It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Subscribe to the AKF Newsletter
Monitoring the Good, the Bad, the Ugly for Improved Fault Detection
July 8, 2018 | Posted By: Robin McGlothin
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.
A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents. A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.
Our conversations with clients confirm that detecting these failures is a significant problem. AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them! Application-level failures can sometimes take days to detect, though they are repaired quickly once found. Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
The duration of a product impairment is TTR.
To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening. Then, rely upon application and system monitoring to inform you on where and what has failed. Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates. We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.
In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:
- Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc. Use math to determine abnormal behavior. This is the first line of defense.
- Synthetic Transactions (Simulate User Behavior): Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope. Alerts from a passive monitor can be confirmed or denied and escalated as appropriate. This is the second line of defense.
Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!)
If you can’t identify all problem areas, identify as many as possible. The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly! That’s when things start falling apart and people point fingers.
At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site. Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators. Effective monitoring will detect these faults as well.
The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.
In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control. The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified. If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected. Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.
One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems. An alternative to SPC is First and Second Derivative testing. These tests tell if the actual and expected curve forms are the same.
Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay.
We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the Network Operations Center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine – but our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
How would your company react to this critical change in normal usage patterns? Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
Eight Reasons To Avoid Stored Procedures
June 18, 2018 | Posted By: Pete Ferguson
In my short tenure at AKF, I have found the topic of Stored Procedures (SPROCs) to be provocatively polarizing. As we conduct a technical due diligence with a fairly new upstart for an investment firm and ask if they use stored procedures on their database, we often get a puzzled look as though we just accused them of dating their sister and their answer is a resounding “NO!”
However, when conducting assessments of companies that have been around awhile and are struggling to quickly scale, move to a SaaS model, and/or migrate from hosted servers to the cloud, we find “server huggers” who love to keep their stored procedures on their database.
At two different clients earlier this year, we found companies who have thousands of stored procedures in their database. What was once seen as a time-saving efficiency is now one of several major obstacles to SaaS and cloud migration.
In our book, Scalability Rules: Principles for Scaling Web Sites, (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites) Marty outlines many reasons why stored procedures should not be kept in the database, here are the top 8:
- Cost: Databases tend to be one of the most expensive systems or services within the system architecture. Each transaction cost increases with each additional SPROC. Increase cost of scale by making a synchronous call to the ERP system for each transaction – while also reducing the availability of the product platform by adding yet another system in series – doesn’t make good business sense.
- Creates a Monolith: SPROCs on a database create a monolithic system which cannot be easily scaled.
- Limits Scalability: The database is a governor of scale, SPROCS steal capacity by running other than relational transactions on the database.
- Limits Automated Testing: SPROCs limit the automation of code testing (in many cases it is not as easy to test stored procedures as it is the other code that developers write), slowing time to market and increasing cost while decreasing quality.
- Creates Lockin: Changing to an open-source or a NoSQL solution requires the need to develop a plan to migrate SPROCs or replace the logic in the application. It also makes it more difficult to switch to new and compelling technologies, negotiate better pricing, etc.
- Adds Unneeded Complexity to Shard Databases: Using SPROCs and business logic on the database makes sharding and replacement of the underlying database much more challenging.
- Limits Speed To The Weakest Link: Systems should scale independently relative to individual needs. When business logic is tied to the database, each of them needs to scale at the same rate as the system making requests of them - which means growth is tied to the slowest system.
- More Team Composition Flexibility: By separating product and business intelligence in your platform, you can also separate the teams that build and support those systems. If a product team is required to understand how their changes impact all related business intelligence systems, it will slow down their pace of innovation as it significantly broadens the scope when implementing and testing product changes and enhancements.
Per the AKF Scale Cube, we desire to separate dissimilar services - having stored procedures on the database means it cannot be split easily.
Need help migrating from hosted hardware to the cloud or migrating your installed software to a SaaS solution? We have helped hundreds of companies from small startups to well-established Fortune 50 companies better architect, scale, and deliver their products. We offer a host of services from technical due diligences, onsite workshops, and provide mentoring and interim staffing for your company.
Subscribe to the AKF Newsletter
June 11, 2018 | Posted By: Marty Abbott
Of the many SaaS operating principles, perhaps one of the most misunderstood is the principle of tenancy.
Most people have a definition in their mind for the term “multi-tenant”. Unfortunately, because the term has so many valid interpretations its usage can sometimes be confusing. Does multi-tenant refer to the physical or logical implementation of our product? What does multi-tenant mean when it comes to an implementation in a database?
This article first covers the goals of increasing tenancy within solutions, then delves into the various meanings of tenancy.
Multi-Tenant (Multitenant) Solutions and Cost
One of the primary reasons why companies that present products as a service strive for higher levels of tenancy is the cost reduction it affords the company in presenting a service. With multiple customers sharing applications and infrastructure, system utilization goes up: We get more value production out of each server that we use, or alternatively, we get greater asset utilization.
Because most companies view the cost of serving customers as a “Cost of Goods Sold’, multitenant solutions have better gross margins than single-tenant solutions. The X-Axis of the figure below shows the effect of increasing tenancy on the cost of goods sold on a per customer basis:
Interestingly, multitenant solutions often “force” another SaaS principle to be true: No more than 1 to 3 versions of software for the entire customer base. This is especially true if the database is shared at a logical (row-level) basis (more on that later). Lowering the number of versions of the product decreases the operating expense necessary to maintain multiple versions and therefore also increases operating margins.
Single Tenant, Multi-Tenant and All-Tenant
An important point to keep in mind is that “tenancy” occurs along a spectrum moving from single-tenant to all-tenant. Multitenant is any solution where the number of tenants from a physical or logical perspective is greater than one, including all-tenant implementations. As tenancy increases, so does Cost of Goods Sold (COGS from the above figure) decrease and Gross Margins increase.
The problem with All-Tenant solutions, while attractive from a cost perspective, is that they create a single failure domain [insert https://akfpartners.com/growth-blog/fault-isolation], thereby decreasing overall availability. When something goes poorly with our product, everything is offline. For that reason, we differentiate between solutions that enable multi-tenancy for cost reasons and all-tenant solutions.
The Many Meanings and Implementations of Tenancy
Multitenant solutions can be implemented in many ways and at many tiers.
Physical and Logical
Physical multi-tenancy is having multiple customers share a number of servers. This helps increase the overall utilization of these servers and therefore reduce costs of goods sold. Customers need not share the application for a solution to be physically multitenant. One could, for instance, run a web server, application server or database per customer. Many customers with concerns over data separation and privacy are fine with physical multitenancy as long as their data is logically separated.
Logical multi-tenancy is having data share the same application. The same web server instances, application server instances and database is used for any customer. The situation becomes a bit murkier however when it comes to databases.
Different relational databases use different terms for similar implementations. A SQLServer database, for instance, looks very much like an Oracle Schema. Within databases, a solution can be logically multitenant by either implementing tenancy in a table (we call that row level multitenancy) or within a schema/database (we call that schema multitenancy).
In either case, a single instance of the relational database management system or software (RDBMS) is used, while customer transactions are separated by a customer id inside a table, or by database/schema id if separated as such.
While physical multitenancy provides cost benefits, logical multitenancy often provides significantly greater cost benefits. Because applications are shared, we need less system overhead to run an application for each customer and thusly can get even greater throughput and efficiencies out of our physical or virtualized servers.
Depth of Multi-Tenancy
The diagram below helps to illustrate that every layer in our service architecture has an impact to multi-tenancy. We can be physically or logically multi-tenant at the network layer, the web server layer, the application layer and the persistence or database layer.
The deeper into the stack our tenancy goes, the greater the beneficial impact (cost savings) to costs of goods sold and the higher our gross margins.
The AKF Multi-Tenant Cube
To further the understanding of tenancy, we introduce the AKF Multi-Tenant Cube:
The X-Axis describes the “mode’ of tenancy, moving from shared nothing, to physical, to logical. As we progress from sharing nothing to sharing everything, utilization goes up and cost of goods sold goes down.
The Y-Axis describes the depth of tenancy from shared nothing, through network, web, app and finally persistence or database tier. Again, as the depth of tenancy increase, so do Gross Margins.
The Z-Axis describes the degree of tenancy, or the number of tenants. Higher levels of tenancy decrease costs of goods sold, but architecturally we never want a failure domain that encompasses all tenants.
When running a XaaS (SaaS, etc) business, we are best off implementing logical multitenancy through every layer of our architecture. While we want tenancy to be high per instance, we also do not want all tenants to be in a single implementation.
AKF Partners helps companies of all sizes achieve their availability, time to market, scalability, cost and business goals. Contact us today to see how we can help your organization in scalability and availability.
Subscribe to the AKF Newsletter
4 Landmines When Using Serverless Architecture
May 20, 2018 | Posted By: Dave Berardi
Physical Bare Metal, Virtualization, Cloud Compute, Containers, and now Serverless in your SaaS? We are starting to hear more and more about Serverless computing. Sometimes you will hear it called function as a service. In this next iteration of Infrastructure-as-a-Service, users can execute a task or function without having to provision a server, virtual machine, or any other underlying resource. The word Serverless is a misnomer as provisioning the underlying resources are abstracted away from the user, but they still exist underneath the covers. It’s just that Amazon, Microsoft, and Google manage it for you with their code. AWS Lambda, Azure Functions, and Google Cloud Functions are becoming more common in the architecture of a SaaS product. As technology leaders responsible for architectural decisions for scale and availability, we must understand its pros and cons and take the right actions to apply it.
Several advantages of serverless computing include:
• Software engineers can deploy and run code without having to manage any underlying infrastructure effectively creating a No-Ops environment.
• Auto-scaling is easier and requires less orchestration as compared to a containerized environment running services.
• True On-Demand capacity – no orphaned containers or other resources that might be idling.
• They are cost effective IF we are running the right size workloads.
Disadvantages and potential landmines to watch out for:
• Landmine #1 - No control over the execution environment meaning you are unable to isolate your operational environment. Compute and networking resources are virtualized with no visibility into either of them. Availability is the hands of our cloud provider and uptime is not guaranteed.
• Landmine #2 - SLAs cannot guarantee uptime. Start-up time can take a second causing latency that might not be acceptable.
• Landmine #3 - It’s going to become much easier for engineers to create code, host it rapidly, and forget about it leading to unnecessary compute and additional attack vectors creating a security risk.
• Landmine #4 - You will create vendor lock-in with your cloud provider as you set up your event driven functions to trigger from other AWS or Azure Services or your own services running on compute instances.
AKF is often asked about our position on serverless computing. There are 4 key rules considering the advantages and the landmines that we outlined:
1) Gradually introduce it into your architecture and use it for the right use cases
2) Establish architectural principles that guide its use in your organization that will minimize availability impact for Serverless. You will tie your availability to the FaaS in your cloud provider.
3) Watch out for a false sense of security among your engineering teams. Understand how serverless works before you use it and so you can monitor it for performance and availability.
4) Manage how and what it’s used for - monitor it (eg. AWS Cloud Watch) to avoid neglect and misuse along with cost inefficiencies.
AWS, Azure, or Google Cloud Serverless platforms could provide an affective computing abstraction in your architecture if it’s used for the right use cases, good monitoring is in place, and architectural principles are established.
AKF Partners has helped many companies create highly available and scalable systems that are designed to be monitored. Contact us for a free consultation.
Subscribe to the AKF Newsletter
Fault Isolation in Services Architectures
May 2, 2018 | Posted By: AKF
Our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” and sometimes “Swim lanes” or “Swim-laned Architectures”. We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “Swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of a failure is contained. As such, and depending upon approach, only a portion of users or a portion of functionality of the product is affected. This is akin to circuit breakers in your house - the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker popping and other devices are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations. In order to achieve this, ideally there are no synchronous calls between swim lanes or failure domains made pursuant to a user request. User initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services. Again, “synchronous” is meant to identify a synchronous call (call, wait for a response) pursuant to any user request.
It is acceptable, but not advisable, to have asynchronouss calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swim laned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-axis would mean replicating everything for high availability - and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost effective solutions for recovering from disaster.
Fault Isolation in the Y-axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to it’s swim lane.
While purposely not legible (fuzzy) the fake example above shows different components of a fictional business account from a fictional bank. Components of the page are separated with one component showing a summary, another component displaying more detailed information and still other components showing dynamic or static links - each derived from properly isolated services.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swim lane along customer, order number or product id lines. More beneficially, if we are interested in fastest possible response times to customers, we may split along geographic boundaries. We may have data centers (or IaaS regions) serving the West and East Coasts of the US respectively, the “Fly-Over States” of the US, and regions serving the EU, Canada, Asia, etc. Besides contributing to faster perceived customer response times, these implementations can also help ensure we are compliant with data sovereignty laws unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve a high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault isolated solutions. Send us a note - we’d love to help you!
Subscribe to the AKF Newsletter
The Scale Cube
April 25, 2018 | Posted By: Robin McGlothin
The Scale Cube is a model for defining microservices and scaling technology products. AKF Partners invented the Scale Cube in 2007, publishing it online in our blog in 2007 (original article here) and subsequently in our first book the Art of Scalability and our second book Scalability Rules.
The Scale Cube (sometimes known as the “AKF Scale Cube” or “AKF Cube”) is comprised of an 3 axes:
• X-Axis: Horizontal Duplication and Cloning of services and data
• Y-Axis: Functional Decomposition and Segmentation - Microservices (or micro-services)
• Z-Axis: Service and Data Partitioning along Customer Boundaries - Shards/Pods
These axes and their meanings are depicted below in Figure 1.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed and when existing systems are being improved.
Figure 2, below, displays how the cube may be deployed in a modern architecture decomposing services (sometimes called microservices architecture), cloning services and data sources and segmenting similar objects like customers into “pods”.
Scaling with the X Axis of the Scale Cube
The most commonly used approach of scaling an solution is by running multiple identical copies of the application behind a load balancer also known as X-axis scaling. That’s a great way of improving the capacity and the availability of an application.
When using X-axis scaling each server runs an identical copy of the service (if disaggregated) or monolith. One benefit of the X axis is that it is typically intellectually easy to implement and it scales well from a transaction perspective. Impediments to implementing the X axis include heavy session related information which is often difficult to distribute or requires persistence to servers – both of which can cause availability and scalability problems. Comparative drawbacks to the X axis is that while intellectually easy to implement, data sets have to be replicated in their entirety which increases operational costs. Further, caching tends to degrade at many levels as the size of data increases with transaction volumes. Finally, the X axis doesn’t engender higher levels of organizational scale.
Figure 3 explains the pros and cons of X axis scalability, and walks through a traditional 3 tier architecture to explain how it is implemented.
Scaling with the Y Axis of the Scale Cube
Y-axis scaling (think services oriented architecture, microservices or functional decomposition of a monolith) focuses on separating services and data along noun or verb boundaries. These splits are “dissimilar” from each other. Examples in commerce solutions may be splitting search from browse, checkout from add-to-cart, login from account status, etc. In implementing splits, Y-axis scaling splits a monolithic application into a set of services. Each service implements a set of related functionalities such as order management, customer management, inventory, etc. Further, each service should have its own, non-shared data to ensure high availability and fault isolation. Y axis scaling shares the benefit of increasing transaction scalability with all the axes of the cube.
Further, because the Y axis allows segmentation of teams and ownership of code and data, organizational scalability is increased. Cache hit ratios should increase as data and the services are appropriately segmented and similarly sized memory spaces can be allocated to smaller data sets accessed by comparatively fewer transactions. Operational cost often is reduced as systems can be sized down to commodity servers or smaller IaaS instances can be employed.
Figure 4 explains the pros and cons of Y axis scalability and shows a fault-isolated example of services each of which has its own data store for the purposes of fault-isolation.
Scaling with the Z Axis of the Scale Cube
Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things. Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary. Product catalogs may be split by SKU, and content may be split by content_id. Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability. Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.
Figure 5 explains the pros and cons of Z axis scalability and displays a fault-isolated pod structure with 2 unique customer pods in the US, and 2 within the EU. Note, that an additional benefit of Z axis scale is the ability to segment pods to be consistent with local privacy laws such as the EU’s GDPR.
Like Goldilocks and the three bears, the goal of decomposition is not to have services that are too small, or services that are too large but to ensure that the system is “just right” along the dimensions of scale, cost, availability, time to market and response times.
AKF Partners has helped hundreds of companies, big and small, employ the AKF Scale Cube to scale their technology product solutions. We developed the cube in 2007 to help clients scale their products and have been using it since to help some of the greatest online brands of all time thrive and succeed. For those interested in “time travel”, here are the original 2 posts on the cube from 2007: Application Cube, Database Cube
Subscribe to the AKF Newsletter
Microservices for Breadth, Libraries for Depth
April 10, 2018 | Posted By: Marty Abbott
The decomposition of monoliths into services, or alternatively the development of new products in a services-oriented fashion (oftentimes called microservices), is one of the greatest architectural movements of the last decade. The benefits of a services (alternatively microservices or micro-services) approach are clear:
- Independent deployment, decreasing time to market and decreasing time to value realization– especially when continuous delivery is employed.
- Team velocity and ownership (informed by Conway’s Law).
- Increased fault isolation – but only when properly deployed (see below).
- Individual scalability – and the decreasing cost of operations that entails when properly architected.
- Freedom of implementation and technology choices – choosing the best solution for each service rather than subjecting services to the lowest common denominator implementation.
Unfortunately, without proper architectural oversight and planning, improperly architected services can also result in:
- Lower overall availability, especially when those services are deployed in one of a handful of microservice anti-patterns like the mesh, services in depth (aka the Christmas Tree Light String) and the Fuse.
- Higher (longer) response times to end customers.
- Complicated fault isolation and troubleshooting that increases average recovery time for failures.
- Service bloat: Too many services to comprehend (see our service sizing post)
The following are patterns companies should avoid (anti-patterns) when developing services or microservices architectures:
Mesh architectures, where individual services both “fan out” and “share” subsequent services result in the lowest possible availability.
Services that are strung together in long (deep) call trees suffer from low availability and slow page response times as calculated from the product of each service offering availability.
The Fuse is a much smaller anti-pattern than “The Mesh”. In “The Fuse”, 2 distinct services (A and B) rely on service C. Should service C become slow or unavailable, both service A and B suffer.
Architecture Principle: Services – Broad, But Never Deep
These services anti-patterns protect against a lack of fault isolation, where slowness and failures propagate along a synchronous path. One service fails, and the others relying upon that service also suffer.
They also serve to guard against longer latency in call streams. While network calls tend to be minimal relative to total customer response times, many solutions (e.g. payment solutions) need to respond as quickly as possible and service calls slow that down.
Finally, these patterns help protect against difficult to diagnose failures. The Xmas Tree pattern name is chosen because of the difficulty in finding the “failed bulb” in old tree lights wired in series. Similarly, imagine attempting to find the fault in “The Mesh”. The time necessary to find faults negatively effects service restoration time and therefore availability.
As such, we suggest a principal that services should never be deep but instead should be deployed in breadth along product offering boundaries defined by nouns (resources like “customer” or “sales”) or verbs (services like “search” or “add to cart”). We often call this approach “slices instead of layers”.
How then do we accomplish the separation of software for team ownership, and time to market where a single service would otherwise be too large or unwieldy?
Old School – Libraries!
When you need service-like segmentation in a deep call tree but can’t suffer the availability impact and latency associated with multiple calls, look to libraries. Libraries will both eliminate the network associated latency of a service call. In the case of both The Fuse and The Mesh libraries eliminate the shared availability constraints. Unfortunately, we still have the multiplicative effect of failure of the Xmas Tree, but overall it is a faster pattern.
“But My Teams Can’t Release Separately!”
Sure they can – they just have to change how they think about releasing. If you need immediate effect from what you release and don’t want to release the calling services with libraries compiled or linked, consider performing releases with shared objects or dynamically loadable libraries. While these require restarts of the calling service, simple automation will help you keep from having an outage for the purpose of deploying software.
AKF Partners helps companies architecture highly available, highly scalable microservice architecture products. We apply our aggregate experience, proprietary models, patterns, and anti-patterns to help ensure your products can meet your company’s scale and availability goals. Contact us today - we can help!
Subscribe to the AKF Newsletter
SaaS Migration Challenges
March 12, 2018 | Posted By: Dave Swenson
More and more companies are waking up from the 20th century, realizing that their on-premise, packaged, waterfall paradigms no longer play in today’s SaaS, agile world. SaaS (Software as a Service) has taken over, and for good reason. Companies (and investors) long for the higher valuation and increased margins that SaaS’ economies of scale provide. Many of these same companies realize that in order to fully benefit from a SaaS model, they need to release far more frequently, enhancing their products through frequent iterative cycles rather than massive upgrades occurring only 4 times a year. So, they not only perform a ‘lift and shift’ into the cloud, they also move to an Agile PDLC. Customers, tired of incurring on-premise IT costs and risks, are also pushing their software vendors towards SaaS.
SaaS Migration is About More Than Just Technology – It is An Organization Reboot
But, what many of the companies migrating to SaaS don’t realize is that migrating to SaaS is not just a technology exercise. Successful SaaS migrations require a ‘reboot’ of the entire company. Certainly, the technology organization will be most affected, but almost every department in a company will need to change. Sales teams need to pitch the product differently, selling a leased service vs. a purchased product, and must learn to address customers’ typical concerns around security. The role of professional services teams in SaaS drastically changes, and in most cases, shrinks. Customer support personnel should have far greater insight into reported problems. Product management in a SaaS world requires small, nimble enhancements vs. massive, ‘big-bang’ upgrades. Your marketing organization will potentially need to target a different type of customer for your initial SaaS releases - leveraging the Technology Adoption Lifecycle to identify early adopters of your product in order to inform a small initial release (Minimum Viable Product).
It is important to recognize the risks that will shift from your customers to you. In an on-premise (“on-prem”) product, your customer carries the burden of capacity planning, security, availability, disaster recovery. SaaS companies sell a service (we like to say an outcome), not just a bundle of software. That service represents a shift of the risks once held by a customer to the company provisioning the service. In most cases, understanding and properly addressing these risks are new undertakings for the company in question and not something for which they have the proper mindset or skills to be successful.
This company-wide reboot can certainly be a daunting challenge, but if approached carefully and honestly, addressing key questions up front, communicating, educating, and transparently addressing likely organizational and personnel changes along the way, it is an accomplishment that transforms, even reignites, a company.
This is the first in a series of articles that captures AKF’s observations and first-hand experiences in guiding companies through this process.
Don’t treat this as a simple rewrite of your existing product –
Answer these questions first…
Any company about to launch into a SaaS migration should first take a long, hard look at their current product, determining what out of the legacy product is not worth carrying forward. Is all of that existing functionality really being used, and still relevant? Prior to any move towards SaaS, the following questions and issues need to be addressed:
Customization or Configuration?
SaaS efficiencies come from many angles, but certainly one of those is having a single codebase for all customers. If your product today is highly customized, where code has been written and is in use for specific customers, you’ve got a tough question to address. Most product variances can likely be handled through configuration, a data-driven mechanism that enables/disables or otherwise shapes functionality for each customer. No customer-specific code from the legacy product should be carried forward unless it is expected to be used by multiple clients. Note that this shift has implications on how a sales force promotes the product (they can no longer promise to build whatever a potential customer wants, but must sell the current, existing functionality) as well as professional services (no customizations means less work for them).
Many customers, even those who accept the improved security posture a cloud-hosted product provides over their own on-premise infrastructure, absolutely freak when they hear that their data will coexist with other customers’ data in a single multi-tenant instance, no matter what access management mechanisms exist. Multi-tenancy is another key to achieving economies of scale that bring greater SaaS efficiencies. Don’t let go of it easily, but if you must, price extra for it.
Who Owns the Data?
Many products focus only on the transactional set of functionality, leaving the analytics side to their customers. In an on-premise scenario, where the data resides in the customers’ facilities, ownership of the data is clear. Customers are free to slice & dice the data as they please. When that data is hosted, particularly in a multi-tenant scenario where multiple customers’ data lives in the same database, direct customer access presents significant challenges. Beyond the obvious related security issues is the need to keep your customers abreast of the more frequent updates that occur with SaaS product iterations. The decision is whether you replicate customer data into read-only instances, provide bulk export into their own hosted databases, or build analytics into your product?
All of these have costs - ensure you’re passing those on to your customers who need this functionality.
May I Upgrade Now?
Today, do your customers require permission for you to upgrade their installation? You’ll need to change that behavior to realize another SaaS efficiency - supporting of as few versions as possible. Ideally, you’ll typically only support a single version (other than during deployment). If your customers need to ‘bless’ a release before migrating on to it, you’re doing it wrong. Your releases should be small, incremental enhancements, potentially even reaching continuous deployment. Therefore, the changes should be far easier to accept and learn than the prior big-bang, huge upgrades of the past. If absolutely necessary, create a sandbox for customers to access new releases, but be prepared to deal with the potentially unwanted, non-representative feedback from the select few who try it out in that sandbox.
Wait? Who Are We Targeting?
All of the questions above lead to this fundamental issue: Are tomorrow’s SaaS customers the same as today’s? The answer? Not necessarily. First, in order to migrate existing customers on to your bright, shiny new SaaS platform, you’ll need to have functional parity with the legacy product. Reaching that parity will take significant effort and lead to a big-bang approach. Instead, pick a subset or an MVP of existing functionality, and find new customers who will be satisfied with that. Then, after proving out the SaaS architecture and related processes, gradually migrate more and more functionality, and once functional parity is close, move existing customers on to your SaaS platform.
To find those new customers interested in placing their bets on your initial SaaS MVP, you’ll need to shift your current focus on the right side of the Technology Adoption Lifecycle (TALC) to the left - from your current ‘Late Majority’ or ‘Laggards’ to ‘Early Adopters’ or ‘Early Majority’. Ideally, those customers on the left side of the TALC will be slightly more forgiving of the ‘learnings’ you’ll face along the way, as well as prove to be far more valuable partners with you as you further enhance your MVP.
The key is to think out of the existing box your customers are in, to reset your TALC targeting and to consider a new breed of customer, one that doesn’t need all that you’ve built, is willing to be an early adopter, and will be a cooperative partner throughout the process.
Our next article on SaaS migration will touch on organizational approaches, particularly during the build-out of the SaaS product, and the paradigm shifts your product and engineering teams need to embrace in order to be successful.
AKF has led many companies on their journey to SaaS, often getting called in as that journey has been derailed. We’ve seen the many potholes and pitfalls and have learned how to avoid them. Let us help you move your product into the 21st century. See our SaaS Migration service
Subscribe to the AKF Newsletter
Hosting Lessons from Harvey and Irma
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
Subscribe to the AKF Newsletter
How an AI bot beat the world's best gamers
September 5, 2017 | Posted By: Roger Andelin
Last month, a bot developed by OpenAI (co-founded by Elon Musk) beat the world’s best, pro Dota 2 players. This is another milestone accomplishment in the field of artificial intelligence and machine learning and more fuel for the fire of concerns surrounding the AI debate. However, before we jump into that debate, here is some background you should understand about the technology fueling this debate.
The Evolution of Traditional Programming
A lot of what computer programming is can be simplified into three steps. First step, read in some data. Second step, do something with that data. Third step, output some result.
For example, imagine you want to fly somewhere for the weekend. You may first go to your travel app and input some dates, times, number of people traveling, airports, etc. Second, the app uses that information to search its database of available flights. Third, it returns a list of available flights for you to see.
This approach to software design has been the norm since the earliest days of programming. Artificial intelligence, in particular machine learning, has changed that approach. The first step is still the same: Read in some data. The third step is the same: Output some result.
However, with artificial intelligence technologies like machine learning, the second step, doing something with the data, is very different. In the example of finding a flight, a programmer easily can read the software code to understand the sequence of steps the computer has been programmed to do to produce the output data. If the programmer wants to change or improve the program’s behavior she can do that by writing new code or by altering the existing code. For example, if you wanted to compare the prices for available flights near the dates you have selected, a programmer can easily change several lines of code in the program to do just that. The programming code identifies every step the computer takes to arrive at its output. Said another way, the program only does what it’s specifically told to do in the code, nothing more or nothing less.
By contrast, the output of today’s most common machine learning programs is not determined by instructions written in computer code. There is no code for a programmer to read or modify when a change is desired. The output is determined by the program’s neural network.
Neural Networks in Action
What is a neural network? At the core of a neural network is a neuron. Similar to a traditional computer program, a neuron takes some input data, does a mathematical calculation on that data and then outputs some data. A typical neuron in a neural network will receive as input hundreds to thousands of numbers, typically between 0 and 1. A neuron will then multiply each number by a weight and sum the result of all the numbers. Many neurons will then convert the result into a number between 0 and 1. That result is then sent to the next neuron in sequence until the final output neuron is reach.
Here is an example of the math a typical neuron will do: If “x1, x2, x3…” represents input data and “w1, w2, w3…” represent the weights stored in the neuron, the calculation done by the neuron in a neural network looks like this: x1*w1 + x2*w2 + x3*w3 and so forth.
You can think of the calculation inside the neuron in a different way: The neuron is reading in a bunch of numbers and the weights in the neuron determine the importance, “or weight” of that input in producing the output. If the input is not important the weight for the input will be near zero and the input is not passed along to the next neuron. Therefore, the weights in a neuron effectively decided what input is valuable and what input should be ignored.
In a neural network, neurons like the one I described above are connected in parallel and in series to create a matrix of neurons. The input data to a neural network will go into hundreds or thousands of neurons in parallel, all with different weights. The output of those neurons is then sent to another layer of neurons and so forth, usually multiple layers deep. This is called a deep neural network. Another way to look at this is the neurons are grouped into a matrix of rows and columns, all interconnected. The final layer of the neural network is the output layer. Therefore, the final output of a neural network is the result of millions of calculations done by the neurons of that network.
When a programmer creates a neural network in software, the weights for each neuron are initially just random numbers. In other words, the weights arbitrarily decide to either diminish, increase or leave the input data alone, and output from the network is random. However, through a process called training, the weights move from randomly assigned values to values that can produce very useful outputs.
Training is both a time consuming and complicated mathematical process. However, it is much like training you and I would do to get better at something. For example, let’s say I wanted to learn how to shoot and arrow with a bow. I might pick up the bow and arrow, point it at the target, pull back the string and release. In my case, I know the arrow would miss the target. Therefore, I would try again and again making corrections to my aim based on on how far and which direction I was off from the target.
During the training process for a neural network, the weights in each of the neurons are changed slightly to improve the output, or “aim.” The most common approach for making those changes is called backpropagation. Backpropagation is a mathematical approach for applying corrections to every weight in every neuron of the network. During training, input is fed into the network and output is generated. The output is compared to the desired target and the difference between the output and the target is the error. Using the error, backpropagation makes changes to the weights in each neuron to reduce the error. If all goes well during training and backpropagation, the output error diminishes until it reaches expert or better than expert level.
AI vs Humans
In the case of the OpenAI Dota bot that recently beat the world’s best Dota 2 player, the outputs, which were a sequence of steps, strategies and decisions, went from random moves to moves that were so good the bot was able to easily defeat the best pro players in the world. The critical information that enabled the bot to win its matches was stored in the weights of the neurons and the neural network architecture itself.
A good question at this point is to ask if a programmer looking at the Dota 2 bot’s neural network could understand the steps taken by the bot to beat the human player. The answer is no. A programmer can see areas of the neural network that influence an output but it is not possible to explain why the bot took specific steps to formulate its moves and strategies. All the programmer would see is a huge matrix of weights that would be quite overwhelming to interpret.
Another good question to ask is whether or not a program written traditionally by a programmer with step by step instructions could beat the best Dota 2 player. The answer is no. Step by step programs where the programmer specifically instructs the computer to do something would easily be defeated by a professional player. However, a neural network can learn from training things that a programmer would never have the knowledge to program, store that learning in its neurons and use that learning to do things like defeat a human pro.
What makes the Dota 2 bot special is that it learned to beat the best pro players by playing against itself whereas most machine learning programs learn from training on data given to it by a programmer. In machine learning, good training data is like gold. It’s scarce and valuable. (note: This is one reason why Google and other big tech companies want to collect so much data.) Data is used to train neural networks to do useful things like recognizing people and places in your pictures or recognizing your voice from others in your family. OpenAI built a bot that learned almost entirely by playing against itself with the exception of some coaching provided by the OpenAI team. OpenAI has shown clearly that learning can occur without having tons of training data. It’s a little like being able to make gold.
Does the development of the OpenAI dota bot mean bots can now decide to train against themselves and become super bots? No. But it does say that humans can now program two bots to train against each other to become superbots. The key enabler being us. It’s anyone’s guess what type of bot can be imagined and developed in this way, useful or harmful. Obviously to most, a gaming superbot seems pretty innocuous, except of course to the gamer who may unexpectedly run into one during a match. However, it’s not hard to imagine super bots that are not so harmless. Or, perhaps you can just imagine a time when someone trains a bot to play football against itself until the bot becomes better at calling plays and strategy than every coach in the NFL. What happens then? The answer is disruption. Are you ready for it?
AKF Partners recommends that boards and executives direct their teams to identify sources of innovation and patterns of disruption that AI techniques may represent within their respective markets Walmart is already working on facial recognition technology in their stores to determine whether or not shoppers are satisfied at checkout. Will this give them a potential advantage over Amazon? How can machine learning and AI help you prevent fraud in your payment systems or the use of your commerce system to launder money?
AKF is prepared to help answer that question and others you may be facing. We will help you craft your AI strategy, sort through the hype, help you find the opportunities, and identify the potential threats of AI technology to your business.
Reach out to AKF
Subscribe to the AKF Newsletter
When Should You Split Services?
April 3, 2017 | Posted By: AKF
It seems that everyone is on the microservice architecture bus these days (splits by the Y axis on the Scale Cube). One question we commonly receive as companies create their first microservice solution, or as they transition from a monolithic architecture to a microservice architecture is “How large should any of my services be?”.
TTo help answer these questions, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your application should be grouped into a large, monolithic codebases or split up into smaller individual services and swim lanes. You must also keep in mind that splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served focusing their resources on developing a marketable product than by fine tuning their service sizes using the considerations below.
Frequency of Change – Services with a high rate of change in a monolithic codebase cause competition for code resources and can create a number of time to market impacting conflicts between teams including product merge conflicts. Such high change services should be split off into small granular services and ideally placed in their own fault isolative swim lane such that the frequent updates don’t impact other services. Services with low rates of change can be grouped together as there is little value created from disaggregation and a lower level of risk of being impacted by updates.
The diagram below illustrates the relationship we recommend between functionality, frequency of updates, and relative percentage of the codebase. Your high risk, business critical services should reside in the upper right portion being frequently updated by small, dedicated teams. The lower risk functions that rarely change can be grouped together into larger, monolithic services as shown in the bottom left.
Degree of Reuse – If libraries or services have a high level of reuse throughout the product, consider separating and maintaining them apart from code that is specialized for individual features or services. A service in this regard may be something that is linked at compile time, deployed as a shared dynamically loadable library or operate as an independent runtime service.
Team Size – Small, dedicated teams can handle micro services with limited functionality and high rates of change, or large functionality (monolithic solutions) with low rates of change. This will give them a better sense of ownership, increase specialization, and allow them to work autonomously. Team size also has an impact on whether a service should be split. The larger the team, the higher the coordination overhead inherent to the team and the greater the need to consider splitting the team to reduce codebase conflict. In this scenario, we are splitting the product up primarily based on reducing the size of the team in order to reduce product conflicts. Ideally splits would be made based on evaluating the availability increases they allow, the scalability they enable or how they decrease the time to market of development.
Specialized Skills – Some services may need special skills in development that are distinct from the remainder of the team. You may for instance have the need to have some portion of your product run very fast. They in turn may require a compiled language and a great depth of knowledge in algorithms and asymptotic analysis. These engineers may have a completely different skillset than the remainder of your code base which may in turn be interpreted and mostly focused on user interaction and experience. In other cases, you may have code that requires deep domain experience in a very specific area like payments. Each of these are examples of considerations that may indicate a need to split into a service and which may inform the size of that service.
Availability and Fault Tolerance Considerations:
Desired Reliability – If other functions can afford to be impacted when the service fails, then you may be fine grouping them together into a larger service. Indeed, sometimes certain functions should NOT work if another function fails (e.g. one should not be able to trade in an equity trading platform if the solution that understands how many equities are available to trade is not available). However, if you require each function to be available independent of the others, then split them into individual services.
Criticality to the Business – Determine how important the service is to business value creation while also taking into account the service’s visibility. One way to view this is to measure the cost of one hour of downtime against a day’s total revenue. If the business can’t afford for the service to fail, split it up until the impact is more acceptable.
Risk of Failure – Determine the different failure modes for the service (e.g. a billing service charging the wrong amount), what the likelihood and severity of each failure mode occurring is, and how likely you are to detect the failure should it happen. The higher the risk, the greater the segmentation should be.
Scalability of Data – A service may be already be a small percentage of the codebase, but as the data that the service needs to operate scales up, it may make sense to split again.
Scalability of Services – What is the volume of usage relative to the rest of the services? For example, one service may need to support short bursts during peak hours while another has steady, gradual growth. If you separate them, you can address their needs independently without having to over engineer a solution to satisfy both.
Dependency on Other Service’s Data – If the dependency on another service’s data can’t be removed or handled with an asynchronous call, the benefits of disaggregating the service probably won’t outweigh the effort required to make the split.
Effort to Split the Code – If the services are so tightly bound that it will take months to split them, you’ll have to decide whether the value created is worth the time spent. You’ll also need to take into account the effort required to develop the deployment scripts for the new service.
Shared Persistent Storage Tier – If you split off the new service, but it still relies on a shared database, you may not fully realize the benefits of disaggregation. Placing a readonly DB replica in the new service’s swim lane will increase performance and availability, but it can also raise the effort and cost required.
Network Configuration – Does the service need its own subdomain? Will you need to make changes load balancer routing or firewall rules? Depending on the team’s expertise, some network changes require more effort than others. Ensure you consider these changes in the total cost of the split.
The illustration below can be used to quickly determine whether a service or function should be segmented into smaller microservices, be grouped together with similar or dependent services, or remain in a multifunctional, infrequently changing monolith.
Subscribe to the AKF Newsletter
Splitting Databases for Scale
April 3, 2017 | Posted By: AKF
The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic database and having multiple “cloned” applications servers retrieve and store data within that database. For young companies this is a very good approach as if done properly it will also eliminate the need for persistence or affinity to a given application server and as a result will increase customer perceived availability.
The problem, however, with this single monolithic data structure is threefold:
- Even with clustering technology (the existence of a second physical system or database that can take the load of the first in the event of failure), failures of the primary database will result in short service outages for 100% of the user community.
- This approach ultimately relies solely on technical improvements in cpu speed, memory access speed, memory access size, mass
storage access speeds and size, etc to insure the companies needs for scale.
- Relying upon (2) above in the extreme cases is not the most cost effective solutions as the newest and fastest technologies come at
a premium to older generations of technology and do not necessarily have the same processing power per dollar as older and/or
smaller (fewer cpus etc) systems.
As we have argued in the aforementioned post, a great engineering team will think about how to scale their platform well in advance of the need to rely solely upon partner technology advances. By making small modifications to our previously presented “Scale Cube”, the same concepts applied to the problem of splitting services for scale can be useful in addressing how to split a database for scale. As with the AKF Services Scale Cube, the AKF Database Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale transactions applied to a database. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic database – a case where all data is located in a single location and all accesses go to this single database.
The X Axis of the cube represents a means of spreading load across multiple instances of a replicated representation of the data. This is the first approach most companies use in scaling databases and is often both the easiest to implement and the least costly in both engineering time and hardware. Many third party and open source databases have native properties or functions that will allow the near real time replication of data to multiple “read databases”. The engineering cost of such an approach is low as typically database calls only need to be identified as a “read” or “write” and sent to the appropriate write database or bank of read databases. The “bank” of read databases should have reads evenly split across this if possible and many companies employ simple 3d party load balancers to perform this distribution.
Included in our Xaxis split are third party and open source caching solutions that allow reads to be split across “cache” hosts before actually reading from a database upon a cache miss. Caching is another simple way to reduce the load on the database but in our experience is not sufficient for hyper growth SaaS sites. If implemented properly, this Xaxis split also can increase availability as if replication is near real time, a read server can be promoted as the singular “write server” in the event of a “write server” failure. The combination of caching and read/write splits (our X axis) is sufficient for many companies but for companies with extreme hyper growth and massive data retention needs it is often not enough.
The Y Axis of our database cube represents a split by function, service or resource just as it did with the service cube. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in speeding up database calls by allowing more information specific to the request to be held in memory rather than needing to make a disk access. Just as with our approach in scaling services, our recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. It is possible to take a monolithic application and perform physical splits by say URL/URI to different service or resource oriented pools. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service / pool / resource / application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application and data set, especially when both the database and applications are split to represent a “failure domain” or fault isolative infrastructure.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). The most common way to view this is to consider splitting your resources by customer if your entity relationships allow that to happen. In the world of media, you might consider splitting it by article_id or media_id and in the world of commerce a split by product_id might be appropriate. In the case where you split customers from your products and perform splits within customers and products you would be implementing both a Y axis split (splitting by resource or call – customers and products) and a Z axis split (a
modulus of customers and products within their functional splits).
Z axis splits tend to be the most costly for an engineering team to perform as often many functions that might be performed within the database (joins for instance) now need to be performed within the application. That said, if done appropriately they represent the greatest potential for scale for most companies.
Subscribe to the AKF Newsletter
Splitting Applications or Services for Scale
April 3, 2017 | Posted By: AKF
Splitting Applications or Services for Scale
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and potentially communicating with a database. Many if not all of the functions are likely to exist within a monolithic application code base making use of the same physical and virtual resources of the system upon which the functions operate: memory, cpu, disk, network interfaces, etc. Potentially the engineers have the forethought to make the system highly available by positioning a second application server in the mix to be used in the event that the first application server fails.
This monolithic design will likely work fine for many sites that receive low levels of traffic. However, if the product is very successful and receives wide and fast adoption user perceived response times are likely to significantly degrade to the point that the product is almost entirely unusable. At some point, the system will likely even fail under the load as the inbound request rate is significantly greater than the processing power of the system and the resulting departure rate of responses to requests.
A great engineering team will think about how to scale their platform well in advance of such a catastrophic failure. There are many ways to approach how to think about such scalability of a platform and we present several through a representation of a three dimensional cube addressing three approaches to scale that we call the AKF Scale Cube.
The AKF Scale Cube (aka Scale Cube and AKF Cube) consists of an X, Y and Z axes – each addressing a different approach to scale a service. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic service or product identified above: a product wherein all functions exist within a single code base on a single server making use of that server’s finite resources of memory, cpu speed, network ports, mass storage, etc.
The X Axis of the cube represents a means of spreading load across multiple instances of the same application and data set. This is the first approach most companies use to scale their services and it is effective in scaling from a request per second perspective. Oftentimes it is sufficient to handle the scale needs of a moderate sized business. The engineering cost of such an approach is low compared to many of the other options as no significant rearchitecting of the code base is required unless the engineering team needs to eliminate affinity to a specific server because the application maintains state. The approach is simple: clone the system and service and allow it to exist on N servers with each server handling 1/Nth the total requests. Ideally the method of distribution is a loadbalancer configured in a highly available manner with a passive peer that becomes active should the active peer fail as a result of hardware or software problems. We do not recommend leveraging roundrobin DNS as a method of load balancing. If the application does maintain state there are various ways of solving this including a centralized state service, redesigning for statelessness, or as a last resort using the load balancer to provide persistent connections. While the Xaxis approach is sufficient for many companies and distributes the processing of requests across several hosts it does not address other potential bottlenecks like memory constraints where memory is used to cache information or results.
The Y Axis of the cube represents a split by function, service or resource. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in reducing or distributing the amount of memory dedicated to any given application across several systems. A recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. As a quick first step, a monolithic application can be placed on multiple servers and dedicate certain of those servers to specific “services” or URIs. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service/pool/resource/application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as
“swimlaning” an application.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). As with the Y axis split, this split aids not only fault isolation, but significantly reduces the amount of memory necessary
(caching, etc) for most transactions and also reduces the amount of stabile storage to which the device/service needs attach. In this case, you might try a modulus by content id (article), or listing id, or a hash from the received IP address, etc. The Z axis split is often the most costly of all splits and we only recommend it for clients that have hypergrowth or very high rates of transaction. It should only be used after a company has implemented a very granular split along the Y axis. That said, it also can offer the greatest degree of scalability as the number of “swimlanes within swimlanes” that it creates is virtually limitless. For instance, if a company implements a Z axis split as a modulus of some transaction id and the implementation is a configurable number “N”, then N can be 10, 100, 1000, etc and each order of magnitude increase in N creates nearly an order of magnitude of greater scale for the company.
Subscribe to the AKF Newsletter
1 2 >