September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-axis, Y-axis, and Z-axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
Scaling Your Systems in the Cloud - AKF Scale Cube Explained
September 5, 2018 | Posted By: Pete Ferguson
Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems. While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize. Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.
We see more and more new startups in AWS or Azure – in addition to assisting well-established companies make the transition to the cloud. Regardless of the hosting platform, in our technical due diligence reviews we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)
This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.
Scalability Rules – Chapter 2: Distribute Your Work
Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.
So how did they do it? ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.
“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners)
The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.
At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability. The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube). Sometimes, it’s easier to see these three axes without the confined space of the cube.
X Axis – Horizontal Duplication
The X Axis allows transaction volumes to increase easily and quickly. If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.
A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer. This cloning allows the distribution of transactions across systems evenly for horizontal scale. Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed. Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time. To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.
Y Axis – Split by Function, Service, or Resource
Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy. The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.
In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption. This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.
Y Axis splits also apply to a noun approach. Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.
While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases. Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well. This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.
Z Axis – Separate Similar Things
Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology. Z Axis separation is useful for containerizing customers or a geographical replication of data. If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geograpy, or other subset of data.
This means that each larger customer or geography could have its own dedicated Web, application, and database servers. Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.
For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches. This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.
Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets. In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.
This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented. We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands). Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!
Subscribe to the AKF Newsletter
The Top 20 Technology Blunders
July 20, 2018 | Posted By: Pete Ferguson
One of the most common questions we get is “What are the most common failures you see tech and product teams make?” To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to Design for Rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW. (Related Content: Monitoring for Improved Fault Detection)
2) Confusing Product Release with Product Success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy! (Related Content: Agile and the Cone of Uncertainty)
3) Insular Product Development / Engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work.” (Related Content: The No Surprises Rule)
4) Over Engineering the Solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
Image Source: Hackernoon.com
5) Allowing History to Repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Vendor Lock
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to Find Your Mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering! Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “Big Bang” Fixes
In our experience, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure – Eliminate Synchronous Calls
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc. then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to Create and Incentivize a Culture of Excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. (Related Content: Three Reasons Your Software Engineers May Not Be Successful)
11) Under-Engineer for Scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now! That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point of relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A New PDLC will Fix My Problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance, in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs. (Related Content: The Top Five Most Common PDLC Failures)
14) Inability to Hire Great People Quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) Diminishing or Ignoring SPOFs (Single Point of Failure)
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity Plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge, like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS-based company, the site and services provided is the company’s sole source of revenue! Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
If you are cloud hosted, this still applies to you! We often find in technical due diligence reviews that small companies who are rapidly growing haven’t yet initiated a second active tech stack in a different availability zone or with a second cloud provider. Just because AWS, Azure and others have a fairly reliable track record doesn’t mean they always will. You can outsource services, but you still own the liability!
Image Source: Kaibizzen.com.au
18) No Product Management Team or Person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) Failing to Implement Continuously
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down (back to #17 - multiple active sites also allows for continuous implementation and the ability to roll back). It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Subscribe to the AKF Newsletter
Monitoring the Good, the Bad, the Ugly for Improved Fault Detection
July 8, 2018 | Posted By: Robin McGlothin
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.
A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents. A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.
Our conversations with clients confirm that detecting these failures is a significant problem. AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them! Application-level failures can sometimes take days to detect, though they are repaired quickly once found. Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
The duration of a product impairment is TTR.
To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening. Then, rely upon application and system monitoring to inform you on where and what has failed. Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates. We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.
In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:
- Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc. Use math to determine abnormal behavior. This is the first line of defense.
- Synthetic Transactions (Simulate User Behavior): Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope. Alerts from a passive monitor can be confirmed or denied and escalated as appropriate. This is the second line of defense.
Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!)
If you can’t identify all problem areas, identify as many as possible. The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly! That’s when things start falling apart and people point fingers.
At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site. Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators. Effective monitoring will detect these faults as well.
The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.
In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control. The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified. If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected. Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.
One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems. An alternative to SPC is First and Second Derivative testing. These tests tell if the actual and expected curve forms are the same.
Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay.
We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the Network Operations Center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine – but our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
How would your company react to this critical change in normal usage patterns? Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
Eight Reasons To Avoid Stored Procedures
June 18, 2018 | Posted By: Pete Ferguson
In my short tenure at AKF, I have found the topic of Stored Procedures (SPROCs) to be provocatively polarizing. As we conduct a technical due diligence with a fairly new upstart for an investment firm and ask if they use stored procedures on their database, we often get a puzzled look as though we just accused them of dating their sister and their answer is a resounding “NO!”
However, when conducting assessments of companies that have been around awhile and are struggling to quickly scale, move to a SaaS model, and/or migrate from hosted servers to the cloud, we find “server huggers” who love to keep their stored procedures on their database.
At two different clients earlier this year, we found companies who have thousands of stored procedures in their database. What was once seen as a time-saving efficiency is now one of several major obstacles to SaaS and cloud migration.
In our book, Scalability Rules: Principles for Scaling Web Sites, (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites) Marty outlines many reasons why stored procedures should not be kept in the database, here are the top 8:
- Cost: Databases tend to be one of the most expensive systems or services within the system architecture. Each transaction cost increases with each additional SPROC. Increase cost of scale by making a synchronous call to the ERP system for each transaction – while also reducing the availability of the product platform by adding yet another system in series – doesn’t make good business sense.
- Creates a Monolith: SPROCs on a database create a monolithic system which cannot be easily scaled.
- Limits Scalability: The database is a governor of scale, SPROCS steal capacity by running other than relational transactions on the database.
- Limits Automated Testing: SPROCs limit the automation of code testing (in many cases it is not as easy to test stored procedures as it is the other code that developers write), slowing time to market and increasing cost while decreasing quality.
- Creates Lockin: Changing to an open-source or a NoSQL solution requires the need to develop a plan to migrate SPROCs or replace the logic in the application. It also makes it more difficult to switch to new and compelling technologies, negotiate better pricing, etc.
- Adds Unneeded Complexity to Shard Databases: Using SPROCs and business logic on the database makes sharding and replacement of the underlying database much more challenging.
- Limits Speed To The Weakest Link: Systems should scale independently relative to individual needs. When business logic is tied to the database, each of them needs to scale at the same rate as the system making requests of them - which means growth is tied to the slowest system.
- More Team Composition Flexibility: By separating product and business intelligence in your platform, you can also separate the teams that build and support those systems. If a product team is required to understand how their changes impact all related business intelligence systems, it will slow down their pace of innovation as it significantly broadens the scope when implementing and testing product changes and enhancements.
Per the AKF Scale Cube, we desire to separate dissimilar services - having stored procedures on the database means it cannot be split easily.
Need help migrating from hosted hardware to the cloud or migrating your installed software to a SaaS solution? We have helped hundreds of companies from small startups to well-established Fortune 50 companies better architect, scale, and deliver their products. We offer a host of services from technical due diligences, onsite workshops, and provide mentoring and interim staffing for your company.
Subscribe to the AKF Newsletter
June 11, 2018 | Posted By: Marty Abbott
Of the many SaaS operating principles, perhaps one of the most misunderstood is the principle of tenancy.
Most people have a definition in their mind for the term “multi-tenant”. Unfortunately, because the term has so many valid interpretations its usage can sometimes be confusing. Does multi-tenant refer to the physical or logical implementation of our product? What does multi-tenant mean when it comes to an implementation in a database?
This article first covers the goals of increasing tenancy within solutions, then delves into the various meanings of tenancy.
Multi-Tenant (Multitenant) Solutions and Cost
One of the primary reasons why companies that present products as a service strive for higher levels of tenancy is the cost reduction it affords the company in presenting a service. With multiple customers sharing applications and infrastructure, system utilization goes up: We get more value production out of each server that we use, or alternatively we get greater asset utilization. Because most companies view the cost of serving customers as a “Cost of Goods Sold’, multitenant solutions have better gross margins than single-tenant solutions. The X Axis of the figure below shows the effect of increasing tenancy on the cost of goods sold on a per customer basis:
Interestingly, multitenant solutions often “force” another SaaS principle to be true: No more than 1 to 3 versions of software for the entire customer base. This is especially true if the database is shared at a logical (row-level) basis (more on that later). Lowering the number of versions of the product, decreases the operating expense necessary to maintain multiple versions and therefore also increases operating margins.
Single Tenant, Multi-Tenant and All-Tenant
An important point to keep in mind is that “tenancy” occurs along a spectrum moving from single-tenant to all-tenant. Multitenant is any solution where the number of tenants from a physical or logical perspective is greater than one, including all-tenant implementations. As tenancy increases, so does Cost of Goods Sold (COGS from the above figure) decrease and Gross Margins increase.
The problem with All-Tenant solutions, while attractive from a cost perspective, is that they create a single failure domain [insert https://akfpartners.com/growth-blog/fault-isolation], thereby decreasing overall availability. When something goes poorly with our product, everything is off line. For that reason, we differentiate between solutions that enable multi-tenancy for cost reasons and all-tenant solutions.
The Many Meanings and Implementations of Tenancy
Multitenant solutions can be implemented in many ways and at many tiers.
Physical and Logical
Physical multi-tenancy is having multiple customers share a number of servers. This helps increase the overall utilization of these servers and therefore reduce costs of goods sold. Customers need not share the application for a solution to be physically multitenant. One could, for instance, run a webserver, application server or database per customer. Many customers with concerns over data separation and privacy are fine with physical multitenancy as long as their data is logically separated.
Logical multi-tenancy is having data share the same application. The same webserver instances, application server instances and database is used for any customer. The situation becomes a bit murkier however when it comes to databases.
Different relational databases use different terms for similar implementations. A SQLServer database, for instance, looks very much like an Oracle Schema. Within databases, a solution can be logically multitenant by either implementing tenancy in a table (we call that row level multitenancy) or within a schema/database (we call that schema multitenancy). In either case, a single instance of the relational database management system or software (RDBMS) is used, while customer transactions are separated by a customer id inside a table, or by database/schema id if separated as such.
While physical multitenancy provides cost benefits, logical multitenancy often provides significantly greater cost benefits. Because applications are shared, we need less system overhead to run an application for each customer and thusly can get even greater throughput and efficiencies out of our physical or virtualized servers.
Depth of Multi-Tenancy
The diagram below helps to illustrate that every layer in our service architecture has an impact to multi-tenancy. We can be physically or logically multi-tenant at the network layer, the web server layer, the application layer and the persistence or database layer.
The deeper into the stack our tenancy goes, the greater the beneficial impact (cost savings) to costs of goods sold and the higher our gross margins.
The AKF Multi-Tenant Cube
To further the understanding of tenancy, we introduce the AKF Multi-Tenant Cube.
The X axis describes the “mode’ of tenancy, moving from shared nothing, to physical, to logical. As we progress from sharing nothing to sharing everything, utilization goes up and cost of goods sold goes down.
The Y axis describes the depth of tenancy from shared nothing, through network, web, app and finally persistence or database tier. Again, as the depth of tenancy increase, so do Gross Margins.
The Z axis describes the degree of tenancy, or the number of tenants. Higher levels of tenancy decrease costs of goods sold, but architecturally we never want a failure domain that encompasses all tenants.
When running a XaaS (SaaS, etc) business, we are best off implementing logical multitenancy through every layer of our architecture. While we want tenancy to be high per instance, we also do not want all tenants to be in a single implementation.
AKF Partners helps companies of all sizes achieve their availability, time to market, scalability, cost and business goals.
Subscribe to the AKF Newsletter
4 Landmines When Using Serverless Architecture
May 20, 2018 | Posted By: Dave Berardi
Physical Bare Metal, Virtualization, Cloud Compute, Containers, and now Serverless in your SaaS? We are starting to hear more and more about Serverless computing. Sometimes you will hear it called function as a service. In this next iteration of Infrastructure-as-a-Service, users can execute a task or function without having to provision a server, virtual machine, or any other underlying resource. The word Serverless is a misnomer as provisioning the underlying resources are abstracted away from the user, but they still exist underneath the covers. It’s just that Amazon, Microsoft, and Google manage it for you with their code. AWS Lambda, Azure Functions, and Google Cloud Functions are becoming more common in the architecture of a SaaS product. As technology leaders responsible for architectural decisions for scale and availability, we must understand its pros and cons and take the right actions to apply it.
Several advantages of serverless computing include:
• Software engineers can deploy and run code without having to manage any underlying infrastructure effectively creating a No-Ops environment.
• Auto-scaling is easier and requires less orchestration as compared to a containerized environment running services.
• True On-Demand capacity – no orphaned containers or other resources that might be idling.
• They are cost effective IF we are running the right size workloads.
Disadvantages and potential landmines to watch out for:
• Landmine #1 - No control over the execution environment meaning you are unable to isolate your operational environment. Compute and networking resources are virtualized with no visibility into either of them. Availability is the hands of our cloud provider and uptime is not guaranteed.
• Landmine #2 - SLAs cannot guarantee uptime. Start-up time can take a second causing latency that might not be acceptable.
• Landmine #3 - It’s going to become much easier for engineers to create code, host it rapidly, and forget about it leading to unnecessary compute and additional attack vectors creating a security risk.
• Landmine #4 - You will create vendor lock-in with your cloud provider as you set up your event driven functions to trigger from other AWS or Azure Services or your own services running on compute instances.
AKF is often asked about our position on serverless computing. There are 4 key rules considering the advantages and the landmines that we outlined:
1) Gradually introduce it into your architecture and use it for the right use cases
2) Establish architectural principles that guide its use in your organization that will minimize availability impact for Serverless. You will tie your availability to the FaaS in your cloud provider.
3) Watch out for a false sense of security among your engineering teams. Understand how serverless works before you use it and so you can monitor it for performance and availability.
4) Manage how and what it’s used for - monitor it (eg. AWS Cloud Watch) to avoid neglect and misuse along with cost inefficiencies.
AWS, Azure, or Google Cloud Serverless platforms could provide an affective computing abstraction in your architecture if it’s used for the right use cases, good monitoring is in place, and architectural principles are established.
AKF Partners has helped many companies create highly available and scalable systems that are designed to be monitored. Contact us for a free consultation.
Subscribe to the AKF Newsletter
Fault Isolation in Services Architectures
May 2, 2018 | Posted By: AKF
Our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” and sometimes “Swim lanes” or “Swim-laned Architectures”. We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “Swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of a failure is contained. As such, and depending upon approach, only a portion of users or a portion of functionality of the product is affected. This is akin to circuit breakers in your house - the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker popping and other devices are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations. In order to achieve this, ideally there are no synchronous calls between swim lanes or failure domains made pursuant to a user request. User initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services. Again, “synchronous” is meant to identify a synchronous call (call, wait for a response) pursuant to any user request.
It is acceptable, but not advisable, to have asynchronouss calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swim laned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-axis would mean replicating everything for high availability - and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost effective solutions for recovering from disaster.
Fault Isolation in the Y-axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to it’s swim lane.
While purposely not legible (fuzzy) the fake example above shows different components of a fictional business account from a fictional bank. Components of the page are separated with one component showing a summary, another component displaying more detailed information and still other components showing dynamic or static links - each derived from properly isolated services.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swim lane along customer, order number or product id lines. More beneficially, if we are interested in fastest possible response times to customers, we may split along geographic boundaries. We may have data centers (or IaaS regions) serving the West and East Coasts of the US respectively, the “Fly-Over States” of the US, and regions serving the EU, Canada, Asia, etc. Besides contributing to faster perceived customer response times, these implementations can also help ensure we are compliant with data sovereignty laws unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve a high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault isolated solutions. Send us a note - we’d love to help you!
Subscribe to the AKF Newsletter
The Scale Cube
April 25, 2018 | Posted By: Robin McGlothin
The Scale Cube is a model for building resilient and scalable architectures using patterns and practices that apply broadly to any industry and all solutions. AKF Partners invented the Scale Cube in 2007, publishing it online in our blog in 2007 (original article here) and subsequently in our first book the Art of Scalability and our second book Scalability Rules.
The Scale Cube (sometimes known as the “AKF Scale Cube” or “AKF Cube”) is comprised of an 3 axes: X-axis, Y-axis, and Z-axis.
• Horizontal Duplication and Cloning (X-Axis )
• Functional Decomposition and Segmentation - Microservices (Y-Axis)
• Horizontal Data Partitioning - Shards (Z-Axis)
These axes and their meanings are depicted below in Figure 1.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed and when existing systems are being improved.
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and likely communicate with a database. This monolithic design will work fine for relatively small applications that receive low levels of client traffic. However, this monolithic architecture becomes a kiss of death for complex applications.
A large monolithic application can be difficult for developers to understand and maintain. It is also an obstacle to frequent deployments. To deploy changes to one application component you need to build and deploy the entire monolith, which can be complex, risky, time consuming, require the coordination of many developers and result in long test cycles.
Consequently, you are often stuck with the technology choices that you made at the start of the project. In other words, the monolithic architecture doesn’t scale to support large, long-lived applications.
Figure 2, below, displays how the cube may be deployed in a modern architecture decomposing services (sometimes called micro-services architecture), cloning services and data sources and segmenting similar objects like customers into “pods”.
Scaling Solutions with the X Axis of the Scale Cube
The most commonly used approach of scaling an solution is by running multiple identical copies of the application behind a load balancer also known as X-axis scaling. That’s a great way of improving the capacity and the availability of an application.
When using X-axis scaling each server runs an identical copy of the service (if disaggregated) or monolith. One benefit of the X axis is that it is typically intellectually easy to implement and it scales well from a transaction perspective. Impediments to implementing the X axis include heavy session related information which is often difficult to distribute or requires persistence to servers – both of which can cause availability and scalability problems. Comparative drawbacks to the X axis is that while intellectually easy to implement, data sets have to be replicated in their entirety which increases operational costs. Further, caching tends to degrade at many levels as the size of data increases with transaction volumes. Finally, the X axis doesn’t engender higher levels of organizational scale.
Figure 3 explains the pros and cons of X axis scalability, and walks through a traditional 3 tier architecture to explain how it is implemented.
Scaling Solutions with the Y Axis of the Scale Cube
Y-axis scaling (think services oriented architecture, micro services or functional decomposition of a monolith) focuses on separating services and data along noun or verb boundaries. These splits are “dissimilar” from each other. Examples in commerce solutions may be splitting search from browse, checkout from add-to-cart, login from account status, etc. In implementing splits, Y-axis scaling splits a monolithic application into a set of services. Each service implements a set of related functionalities such as order management, customer management, inventory, etc. Further, each service should have its own, non-shared data to ensure high availability and fault isolation. Y axis scaling shares the benefit of increasing transaction scalability with all the axes of the cube.
Further, because the Y axis allows segmentation of teams and ownership of code and data, organizational scalability is increased. Cache hit ratios should increase as data and the services are appropriately segmented and similarly sized memory spaces can be allocated to smaller data sets accessed by comparatively fewer transactions. Operational cost often is reduced as systems can be sized down to commodity servers or smaller IaaS instances can be employed.
Figure 4 explains the pros and cons of Y axis scalability and shows a fault-isolated example of services each of which has its own data store for the purposes of fault-isolation.
Scaling Solutions with the Z Axis of the Scale Cube
Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things. Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary. Product catalogs may be split by SKU, and content may be split by content_id. Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability. Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.
Figure 5 explains the pros and cons of Z axis scalability and displays a fault-isolated pod structure with 2 unique customer pods in the US, and 2 within the EU. Note, that an additional benefit of Z axis scale is the ability to segment pods to be consistent with local privacy laws such as the EU’s GDPR.
Like Goldilocks and the three bears, the goal of decomposition is not to have services that are too small, or services that are too large but to ensure that the system is “just right” along the dimensions of scale, cost, availability, time to market and response times.
AKF Partners has helped hundreds of companies, big and small, employ the AKF Scale Cube to scale their technology product solutions. We developed the cube in 2007 to help clients scale their products and have been using it since to help some of the greatest online brands of all time thrive and succeed. For those interested in “time travel”, here are the original 2 posts on the cube from 2007: Application Cube, Database Cube
Subscribe to the AKF Newsletter
Microservices for Breadth, Libraries for Depth
April 10, 2018 | Posted By: Marty Abbott
The decomposition of monoliths into services, or alternatively the development of new products in a services-oriented fashion (oftentimes called microservices), is one of the greatest architectural movements of the last decade. The benefits of a services (alternatively microservices or micro-services) approach are clear:
- Independent deployment, decreasing time to market and decreasing time to value realization– especially when continuous delivery is employed.
- Team velocity and ownership (informed by Conway’s Law).
- Increased fault isolation – but only when properly deployed (see below).
- Individual scalability – and the decreasing cost of operations that entails when properly architected.
- Freedom of implementation and technology choices – choosing the best solution for each service rather than subjecting services to the lowest common denominator implementation.
Unfortunately, without proper architectural oversight and planning, improperly architected services can also result in:
- Lower overall availability, especially when those services are deployed in one of a handful of microservice anti-patterns like the mesh, services in depth (aka the Christmas Tree Light String) and the Fuse.
- Higher (longer) response times to end customers.
- Complicated fault isolation and troubleshooting that increases average recovery time for failures.
- Service bloat: Too many services to comprehend (see our service sizing post)
The following are patterns companies should avoid (anti-patterns) when developing services or microservices architectures:
Mesh architectures, where individual services both “fan out” and “share” subsequent services result in the lowest possible availability.
Services that are strung together in long (deep) call trees suffer from low availability and slow page response times as calculated from the product of each service offering availability.
The Fuse is a much smaller anti-pattern than “The Mesh”. In “The Fuse”, 2 distinct services (A and B) rely on service C. Should service C become slow or unavailable, both service A and B suffer.
Architecture Principle: Services – Broad, But Never Deep
These services anti-patterns protect against a lack of fault isolation, where slowness and failures propagate along a synchronous path. One service fails, and the others relying upon that service also suffer.
They also serve to guard against longer latency in call streams. While network calls tend to be minimal relative to total customer response times, many solutions (e.g. payment solutions) need to respond as quickly as possible and service calls slow that down.
Finally, these patterns help protect against difficult to diagnose failures. The Xmas Tree pattern name is chosen because of the difficulty in finding the “failed bulb” in old tree lights wired in series. Similarly, imagine attempting to find the fault in “The Mesh”. The time necessary to find faults negatively effects service restoration time and therefore availability.
As such, we suggest a principal that services should never be deep but instead should be deployed in breadth along product offering boundaries defined by nouns (resources like “customer” or “sales”) or verbs (services like “search” or “add to cart”). We often call this approach “slices instead of layers”.
How then do we accomplish the separation of software for team ownership, and time to market where a single service would otherwise be too large or unwieldy?
Old School – Libraries!
When you need service-like segmentation in a deep call tree but can’t suffer the availability impact and latency associated with multiple calls, look to libraries. Libraries will both eliminate the network associated latency of a service call. In the case of both The Fuse and The Mesh libraries eliminate the shared availability constraints. Unfortunately, we still have the multiplicative effect of failure of the Xmas Tree, but overall it is a faster pattern.
“But My Teams Can’t Release Separately!”
Sure they can – they just have to change how they think about releasing. If you need immediate effect from what you release and don’t want to release the calling services with libraries compiled or linked, consider performing releases with shared objects or dynamically loadable libraries. While these require restarts of the calling service, simple automation will help you keep from having an outage for the purpose of deploying software.
AKF Partners helps companies architecture highly available, highly scalable microservice architecture products. We apply our aggregate experience, proprietary models, patterns, and anti-patterns to help ensure your products can meet your company’s scale and availability goals. Contact us today - we can help!
Subscribe to the AKF Newsletter
1 2 >