March 25, 2019 | Posted By: Marty Abbott
This article is the first in a multi-part series on microservices (micro-services) anti-patterns.
There are several benefits to carving up very large applications into service-oriented architectures. These benefits can include many of the following:
- Higher availability through fault isolation
- Higher organizational scalability through lower coordination
- Lower cost of development through lower overhead (coordination)
- Faster time to market achieved again through lower overhead of coordination
- Higher scalability through the ability to independently scale services
- Lower cost of operations (cost of goods sold) through independent scalability
- Lower latency/response time through better cacheability
The above should be considered only a partial list. See our articles on the AKF Scale Cube, and when you should split services for more information.
In order to achieve any of the above benefits, you must be very careful to avoid common mistakes.
Most of the failures that we see in microservices stem from a lack of understanding of the multiplicative effect of failure or “MEF”. Put simply, MEF indicates that the availability of any solution in series is a product of the availability of all components in that series.
Service A has an availability calculated by the product of its constituent parts. Those parts include all of the software and infrastructure necessary to run service A. The server availability, the application availability, associated library and runtime environment availabilities, operating system availability, virtualization software availability, etc. Let’s say those availabilities somehow achieve a “service” availability of “Five 9s” or 99.999 as measured by duration of outages. To achieve 99.999 we are assuming that we have made the service “highly available” through multiple copies, each being “stateless” in its operation.
Service B has a similar availability calculated in a similar fashion. Again, let’s assume 99.999.
If, for a request from any customer to Service A, Service B must also be called, the two availabilities are multiplied together. The new calculated availability is by definition lower than any service in isolation. We move our availability from 99.999 to 99.998.
When calls in series between services become long, availability starts to decline swiftly and by definition is always much smaller than the lowest availability of any service or the constituent part of any service (e.g. hardware, OS, app, etc).
This creates our first anti-pattern. Just as bulbs in the old serially wired Christmas Tree lights would cause an entire string to fail, so does any service failure cause the entire call stream to fail. Hence multiple names for this first anti-pattern: Christmas Tree Light Anti-Pattern, Microservice Calls in Series Anti-Pattern, etc.
The multiplicative effect of failure sometimes is worse with slowly responding solutions than with failures themselves. We can easily respond from failures through “heartbeat” transactions. But slow responses are more difficult. While we can use circuit breaker constructs such as hystrix switches – these assume that we know the threshold under which our call string will break. Unfortunately, under intense flash load situations (unforeseen high demand), small spikes in demand can cause failure scenarios.
One pattern to resolve the above issue is to employ true asynchronous messaging between services. To make this effective, the requesting service must not care whether it receives a response. This service must be capable of responding to a request without receiving any downstream response. Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A. One such example is a recommendation engine that returns other items a user might like to purchase. The absence of service B responding to A’s request for recommendations is unfortunate, but doesn’t eliminate the value of A’s response completely.
While the above pattern can resolve some use-cases, it doesn’t resolve most of them. Most often downstream services are doing more than “modifying” value for the calling service: they are providing specific necessary functions. These functions may be mail services, print services, data access services, or even component parts of a value stream such as “add to cart” and “compute tax” during checkout.
In these cases, we believe in employing the Libraries for Depth pattern.
Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to another service call. For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc. Additionally, call latency goes down without network interfaces.
The most common complaint about this pattern is that development teams cannot release independently. But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.
Subscribe to the AKF Newsletter
March 19, 2019 | Posted By: Marty Abbott
Tim Berners-Lee and his colleagues at CERN, the IETF and the W3C consortium all understood the value of being stateless when they developed the Hyper Text Transfer Protocol. Stateless systems are more resilient to multiple failure types, as no transaction needs to have information regarding the previous transaction. It’s as if each transaction is the first (and last) of its type.
First let’s quickly review three different types of state. This overview is meant to be broad and shallow. Certain state types (such as the notion of View state in .Net development) are not covered.
The Penalty (or Cost) of State
State costs us in multiple ways. State unique to a user interaction, or session state, requires memory. The larger the state, the more memory requirement, the higher cost of the server and the greater the number of servers we need. As the cost of goods sold increase, margins decrease. Further, that state either needs to be replicated for high availability, and additional cost, or we face a cost of user dissatisfaction with discrete component and ultimately session failures.
When application state is maintained, the cost of failure is high as we either need to pay the price of replication for that state or we lose it, negatively impacting customer experience. As memory associated with application state increases, so does the memory requirement and associated costs of the server upon which it runs. At high scale, that means more servers, greater costs, and lower gross margins. In many cases, we simply have no choice but to allow application state. Interpreters and java virtual machines need memory. Most applications also require information regarding their overall transactions distinct from those of users. As such, our goal here is not to eliminate application state but rather minimize it where possible.
When connection state is maintained, cost increases as more servers are required to service the same number of requests. Failures become more common as the failure probability increases with the duration of any connection over distance.
Our ideal outcome is to eliminate session state, minimize application state and eliminate connection state.
But What if I Really, Really, Really Need State?
Our experience is that simply saying “No” once or twice will force an engineer to find an innovative way to eliminate state. Another interesting approach is to challenge an engineer with a statement like “Huh, I heard the engineers at XYZ company figured out how to do this…”. Engineers hate to feel like another engineer is better than them…
We also recognize however that the complete elimination of state isn’t possible. Here are three examples (not meant to be all inclusive) of when we believe the principle of stateless systems should be violated:
Shopping carts need state to work. Information regarding a past transaction - (add_to_cart) for instance needs to be held somewhere prior to check_out. Given that we need state, now it’s just a question of where to store it. Cookies are good places. Distributed object caches are another location. Passing it through the URL in HTTP GET methods is a third. A final solution is to store it in a database.
No sane person wants to wrap debits and credits across distributed servers in a single, two-phase commit transaction. Banks have had a solution for this for years – the eventual consistent account transaction. Using a tiny workflow or state machine, debit in one transaction and eventually (ideally quickly) subsequently credit in a second transaction. That brings us to the notion of workflow and state machines in general.
What good is a state machine if it can’t maintain state? Whether application state or session state, the notion of state is critical to the success of each solution. Workflow systems are a very specific implementation of a state machine and as such require state. The trick with these is simply to ensure that the memory used for state is “just enough”. Govern against ever increasing session or application state size.
This brings us to the newest cube model in the AKF model repository:
The Session State Cube
The AKF State Cube is useful both for thinking through how to achieve the best possible state posture, and for evaluating how well we are doing against an aspiration goal (top right corner) of “Stateless”.
The X axis describes size of state. It moves from very large (XL) state size to the ideal position of zero size, or “No State”. Very large state size suffers from higher cost, higher impact upon failure, and higher probability of failure.
The Y axis describes the degree of distribution of state. The worst position, lower left, is where state is a singleton. While we prefer not to have state, having only one copy of it leaves us open to large – and difficult to recover from – failures and dissatisfied customers. Imagine nearly completing your taxes only to have a crash wipe out all of your work! Ughh!
Progressing vertically along the Y axis, the singleton state object in the lower left is replicated into N copies of that state for high availability. While resolving the recovery and failure issues, performing replication is costly both in extra memory and network transit. This is an option we hope to avoid for cost reasons.
Following replication are several methods of distribution in increasing order of value. Segmenting the data by some value “N” has increasing value as N increases. When N is 2, a failure of state impacts 50% of our customers. When N is 100, only 1% of our customers suffer from a state failure. Ideally, state is also “rebuildable” if we have properly scattered state segments by a shard key – allowing customers to only have to re-complete a portion of their past work.
Finally, of course, we hope to have “no state” (think of this as division by infinite segmentation approaching zero on this axis).
The Z Axis describes where we position state “physically”.
The worst location is “on the same server as the application”. While necessary for application state, placing session data on a server co-resident with the application using it doubles the impact of a failure upon application fault. There are better places to locate state, and better solutions than your application to maintain it.
A costly, but better solution from an impact perspective is to place state within your favorite database. To keep costs low, this could be an opensource SQL or NoSQL database. But remember to replicate it for high availability.
A less costly solution is to place state in an object cache, off server from the application. Ideally this cache is distributed per the Y axis.
The least costly solution is to have the client (browser or mobile app) maintain state. Use a cookie, pass the state through a GET method, etc.
Finally, of course the best solution is that it is kept “nowhere” because we have no state.
The AKF State Cube serves two purposes:
- Prescriptive: It helps to guide your team to the aspirational goal of “stateless”. Where stateless isn’t possible, choose the X, Y and Z axis closest to the notion of no state to achieve a low cost, highly available solution for your minimized state needs.
- Descriptive: The model helps you evaluate numerically, how you are performing with respect to stateless initiatives on a per application/service basis. Use the guide on the right side of the model to evaluate component state on a scale of 1 to 10.
AKF Partners helps companies develop world class, low cost of operations, fast time to market, stateless solutions every day. Give us a call! We can help!
Subscribe to the AKF Newsletter
March 15, 2019 | Posted By: Marty Abbott
I’m no Nostradamus when it comes to predicting the future of technology, but some trends are just too blatantly obvious to ignore. Unfortunately, they are only easy to spot if you have a job where you are allowed (I might argue required) to observe broader industry trends. AKF Partners must do that on behalf of our clients as our clients are just too busy fighting the day-to-day battles of their individual businesses.
One such very concerning probability is the eventual decline – and one day potentially the elimination of – the colocation (hosting) business. Make no mistake about it – if you lease space from a colocation provider, the probability is high that your business will need to move locations, move providers, or experience a service disruption soon.
Let’s walk through the factors and trends that indicate, at least to me, that the industry is in trouble, and that your business faces considerable risks:
Sources of Demand for Colocation (Macro)
Broadly speaking, the colocation industry was built on the backs of young companies needing to lease space for compute, storage, and the like. As time progressed, more established companies started to augment privately-owned data centers with colocation facilities to avoid the burden of large assets (buildings, capital improvements and in some cases even servers) on their balance sheets.
The first source of demand, small companies, has largely dried up for colocation facilities. Small companies seek to be “asset light” and most frequently start their businesses running on Infrastructure as a Service (IaaS) providers (AWS, GCP, Azure etc.). The ease and flexibility of these providers enable faster time to market and easier operational configuration of systems. Platform as a Service (PaaS) offerings in many cases eliminate the need for specialized infrastructure and DevOps skill sets, allowing small companies to focus limited funds on software engineers that will help create differentiating experiences and capabilities. Five years ago, successful startups may have started migrating into colocation facilities to lower costs of goods sold (COGS) for their products, and in so doing increase gross margin (GM). While this is still an opportunity for many successful companies, few seem to take advantage of it. Whether due to vendor lock-in through PaaS services, or a preference for speed and flexibility over expenses, the companies tend to stay with their IaaS provider.
Larger, more established companies continue to use colocation facilities to augment privately-owned data centers. That said, in most cases technology refresh results in faster and more efficient compute. When the rate of compute increases faster than the rate of growth in transactions and revenue within these companies, they start to collapse the infrastructure assets back into wholly-owned facilities (assuming power, space, and cooling of the facilities are not constraints). Bringing assets back in-house to owned facilities lowers costs of goods sold as the company makes more efficient use of existing assets.
Simultaneously these larger firms also seek the flexibility and elasticity of IaaS services. Where they have new demand for new solutions, or as companies embark upon a digital transformation strategy, they often do so leveraging IaaS.
The result of these forces across the spectrum of small to large firms reduces overall demand. Reduced demand means a contraction in the colocation industry overall.
Minimum Efficient Scale and the Colocation Industry (Micro)
Data centers are essentially factories. To achieve optimum profitability, fixed costs such as the facility itself, and the associated taxes, must be spread across the largest possible units of production. In the case of data centers, this means achieving maximum utilization of the constraining factors (space, power, and cooling capacity) across the largest possible revenue base. Maximizing utilization against the aforementioned constraints drops the LRAC (long run average cost) as fixed costs are spread across a larger number of paying customers. This is the notion of Minimum Efficient Scale in economics.
As demand decreases, on a per data center (colocation facility) basis, fixed costs per customer increases. This is because less space is used, and the cost of the facility is allocated across fewer customers. At some point, on a per data center basis the facility becomes unprofitable. As profits dwindle across the enterprise, and as debt service on the facilities becomes more difficult, the colocation provider is forced to shut down data centers and consolidate customers. Assets are sold or leases terminated with the appropriate termination penalties.
Customers who wish to remain with a provider are forced to relocate. This in turn causes customers to reconsider colocation facilities, and somewhere between a handful to a majority on a per location basis will decide to move to IaaS instead. Thus begins a vicious cycle of data center shutdowns engendering ever-decreasing demand for colocation facilities.
Excluding other macroeconomic or secular events like another real estate collapse, smaller providers start to exit the colocation service industry. Larger providers benefit from the exit of smaller players and the remaining data centers benefit from increased demand on a dwindling supply, allowing those providers to regain MES and profitability.
Does the Trend Stop at a Smaller Industry?
We are likely to continue to see the colocation industry exist for quite some time – but it will get increasingly smaller. The consolidation of providers and dwindling supply of facilities will stop at some point, but just for a period. Those that remain in colocation facilities will either not have the means or the will to move. In some cases, a lack of skills within the remaining companies will keep them “locked into” a colocation. In other cases, competing priorities will keep an exit on the distant horizon. These “lock in” factors will give rise to an opportunity for the colocation industry to increase pricing for a time.
But make no mistake about it, customers will continue to leave – just at a decreased rate relative to today’s departures. Some companies will simply go out of business or contract in size and depart the data centers. Others will finally decide that the increasing cost of service is too high.
While it’s doubtful that the industry will go away in its entirety, it will be small and comparatively expensive. The difference between costs of colocation and costs to run in an IaaS solution will start to dwindle.
Risks to Your Firm
The risk to your firm comes in three forms, listed in increasing order of risk as measured by a function of probability of occurrence and impact upon occurrence:
- Pricing of service per facility. If you are lucky enough that your facility does not close, there is a high probability that your cost for service will increase. This in turn increases your cost of goods sold and decreases your gross margin.
- Risk of facility dissolution. There exists an increasingly high probability that the facilities in which you are located will be shut down. While you are likely to be given some advance notice, you will be required to move to another facility with the same provider, or another provider. There is both a real cost in the move, and an opportunity cost associated with service interruption and effort.
- Risk of firm as a going concern. Some providers of colocation services will simply exit the business. In some cases, you may be given very little notice as in the case of a company filing bankruptcy. Service interruption risk is high.
Strategies You Must Employ Today
In our view, you have no choice but to ensure that you are ready and able to easily move out of colocation facilities. Whether this be to existing data centers you own, IaaS providers, or a mix matters not. At the very least, we suggest your development and operations processes enable the following principles:
- Environment Agnosticism: Ensure that you can run in owned, lease, managed service, or IaaS locations. Ensuring consistency in deployment platforms, using container technologies and employing orchestration systems all aid in this endeavor.
- Hybrid Hosting: Operate out of at least two of the following three options as a course of normal business operations: owned data centers, leased/colocation facilities, IaaS.
- Dynamic Allocation of Demand: Prove on at least a weekly basis that you can operate any functionality within your product out of any location you operate – especially those that happen to be located within colocation facilities.
AKF Partners helps companies think through technology, process, organization, location, and hosting strategies. Let us help you architect a hybrid hosting solution that limits your risk to any single provider.
Subscribe to the AKF Newsletter
February 22, 2019 | Posted By: Greg Fennewald
On multiple occasions over the years, we have heard our clients state a use case they want to avoid in product design sessions or as a reason for architectural choices made for existing products. These use cases can be given more credence than they deserve based on objective data – they become boogeyman legends, edge cases that can result in poor architectural choices.
One of our clients was debating the benefit of multiple live sites with customers pinned to the nearest site to minimize latency. The availability benefits of multiple live sites are irrefutable, but the customer experience benefit of less latency was questioned. This client had millions of clients spread across the country. The notion of pinning a client to a “home” site nearest them raised the question of “what happens when the client travels across the country?”. The answer is to direct them to that same home site. That client will experience more latency for the duration of the visit. The proportion of clients that spend 50% of their time on either coast is vanishingly small – keep it simple. Have a work around for clients that permanently move to a location served by a different site – client data resides in more than one location for DR purposes anyway, right?
This client also had hundreds of service specialists that would at times access client accounts and take actions on their behalf, and these service specialists were located near the west coast. Objections were made based on the latency a west coast service specialist would encounter when acting on the behalf of an east coast client whose data was hosted near the east coast. Millions of clients. Hundreds of service specialists. The math is not hard. The needs of the many outweigh the needs of the few.
A different client had a concern about data consistency upon new user registration for their service. To ensure a new customer could immediately transact, the team decided to deploy a single authentication server to preclude the possibility of a transaction following registration hitting an authentication server that had not yet received the registration data. Intentionally deploying a SPOF should have raised immediate objections but did not. The team deployed a passive backup server that required manual intervention to work.
The new user process flow was later revealed to be less than 3% of the overall transactions. 97% of the transactions suffered an impactful outage along with the 3% new users when the SPOF authentication server failed. Designing a workaround for the new users while employing a write master with multiple, load balanced read only slaves would provide far better availability. The needs of the many outweigh the needs of the few.
It is important to remain open minded during early design sessions. It is also important to follow architectural principles in the face of such use cases. How can one balance potentially conflicting concepts?
• Ask questions best answered with objective data.
• Strive for simplicity, shave with Occam’s Razor
• Validate whether the edge case is a deal breaker for the product owner
• Propose a work around that addresses the edge case while optimizing the architecture for the majority use case and sound principles.
Catering to the needs of the business while adhering to architectural standards is a delicate balancing act and compromises will be made. Everyone looks at the technologist when a product encounters a failure. Know when to hold the line on sound architectural principles that safeguard product availability and user experience. The product owner must understand and acknowledge the architectural risks resulting from product design decisions. The technologist must communicate these risks to the product owner along with objective data and options. A failure to communicate effectively can lead to the tail wagging the dog – do not let that happen.
With 12 years of product architecture and strategy experience, AKF Partners is uniquely positioned to be your technology partner. Learn more here.
Subscribe to the AKF Newsletter
December 4, 2018 | Posted By: Marty Abbott
During the last 12 years, many prospective clients have asked us some variation of the following questions: “What makes you different?”, “Why should we consider hiring you?”, or “How are you differentiated as a firm?”.
The answer has many components. Sometimes our answers are clear indications that we are NOT the right firm for you. Here are the reasons you should, or should not, hire AKF Partners:
Operators and Executives – Not Consultants
Most technology consulting firms are largely comprised of employees who have only been consultants or have only run consulting companies. We’ve been in your shoes as engineers, managers and executives. We make decisions and provide advice based on practical experience with living with the decisions we’ve made in the past.
Engineers – Not Technicians
Educational institutions haven’t graduated enough engineers to keep up with demand within the United States for at least forty years. To make up for the delta between supply and demand, technical training services have sprung up throughout the US to teach people technical skills in a handful of weeks or months. These technicians understand how to put building blocks together, but they are not especially skilled in how to architect highly available, low latency, low cost to develop and operate solutions.
The largest technology consulting companies are built around programs that hire employees with non-technical college degrees. These companies then teach these employees internally using “boot camps” – creating their own technicians.
Our company is comprised almost entirely of “engineers”; employees with highly technical backgrounds who understand both how and why the “building blocks” work as well as how to put those blocks together.
Product – Not “IT”
Most technology consulting firms are comprised of consultants who have a deep understanding of employee-facing “Information Technology” solutions. These companies are great at helping you implement packaged software solutions or SaaS solutions such as Enterprise Resource Management systems, Customer Relationship Management Systems and the like. Put bluntly, these companies help you with solutions that you see as a cost center in your business. While we’ve helped some partners who refuse to use anyone else with these systems, it’s not our focus and not where we consider ourselves to be differentiated.
Very few firms have experience building complex product (revenue generating) services and platforms online. Products (not IT) represent nearly all of AKF’s work and most of AKF’s collective experience as engineers, managers and executives within companies. If you want back-office IT consulting help focused on employee productivity there are likely better firms with which you can work. If you are building a product, you do not want to hire the firms that specialize in back office IT work.
Business First – Not Technology First
Products only exist to further the needs of customers and through that relationship, further the needs of the business. We take a business-first approach in all our engagements, seeking to answer the questions of: Can we help a way to build it faster, better, or cheaper? Can we find ways to make it respond to customers faster, be more highly available or be more scalable? We are technology agnostic and believe that of the several “right” solutions for a company, a small handful will emerge displaying comparatively low cost, fast time to market, appropriate availability, scalability, appropriate quality, and low cost of operations.
Cure the Disease – Don’t Just Treat the Symptoms
Most consulting firms will gladly help you with your technology needs but stop short of solving the underlying causes creating your needs: the skill, focus, processes, or organizational construction of your product team. The reason for this is obvious, most consulting companies are betting that if the causes aren’t fixed, you will need them back again in the future.
At AKF Partners, we approach things differently. We believe that we have failed if we haven’t helped you solve the reasons why you called us in the first place. To that end, we try to find the source of any problem you may have. Whether that be missing skillsets, the need for additional leadership, organization related work impediments, or processes that stand in the way of your success – we will bring these causes to your attention in a clear and concise manner. Moreover, we will help you understand how to fix them. If necessary, we will stay until they are fixed.
We recognize that in taking the above approach, you may not need us back. Our hope is that you will instead refer us to other clients in the future.
Are We “Right” for You?
That’s a question for you, not for us, to answer. We don’t employ sales people who help “close deals” or “shape demand”. We won’t pressure you into making a decision or hound you with multiple calls. We want to work with clients who “want” us to partner with them – partners with whom we can join forces to create an even better product solution.
Subscribe to the AKF Newsletter
November 20, 2018 | Posted By: Robin McGlothin
“Quality in a service or product is not what you put into it. It’s what the customer gets out of it.” Peter Drucker
The Importance of QA
High levels of quality are essential to achieving company business objectives. Quality can be a competitive advantage and in many cases will be table stakes for success. High quality is not just an added value, it is an essential basic requirement. With high market competition, quality has become the market differentiator for almost all products and services.
There are many methods followed by organizations to achieve and maintain the required level of quality. So, let’s review how world-class product organizations make the most out of their QA roles. But first, let’s define QA.
According to Wikipedia, quality assurance is “a way of preventing mistakes or defects in products and avoiding problems when delivering solutions or services to customers. But there’s much more to quality assurance.”
There are numerous benefits of having a QA team in place:
- Helps increase productivity while decreasing costs (QA HC typically costs less)
- Effective for saving costs by detecting and fixing issues and flaws before they reach the client
- Shifts focus from detecting issues to issue prevention
Teams and organizations looking to get serious about (or to further improve) their software testing efforts can learn something from looking at how the industry leaders organize their testing and quality assurance activities. It stands to reason that companies such as Google, Microsoft, and Amazon would not be as successful as they are without paying proper attention to the quality of the products they’re releasing into the world. Taking a look at these software giants reveals that there is no one single recipe for success. Here is how five of the world’s best-known product companies organize their QA and what we can learn from them.
Google: Searching for best practices
How does the company responsible for the world’s most widely used search engine organize its testing efforts? It depends on the product. The team responsible for the Google search engine, for example, maintains a large and rigorous testing framework. Since search is Google’s core business, the team wants to make sure that it keeps delivering the highest possible quality, and that it doesn’t screw it up.
To that end, Google employs a four-stage testing process for changes to the search engine, consisting of:
- Testing by dedicated, internal testers (Google employees)
- Further testing on a crowdtesting platform
- “Dogfooding,” which involves having Google employees use the product in their daily work
- Beta testing, which involves releasing the product to a small group of Google product end users
Even though this seems like a solid testing process, there is room for improvement, if only because communication between the different stages and the people responsible for them is suboptimal (leading to things being tested either twice over or not at all).
But the teams responsible for Google products that are further away from the company’s core business employ a much less strict QA process. In some cases, the only testing done by the developer responsible for a specific product, with no dedicated testers providing a safety net.
In any case, Google takes testing very seriously. In fact, testers’ and developers’ salaries are equal, something you don’t see very often in the industry.
Facebook: Developer-driven testing
Like Google, Facebook uses dogfooding to make sure its software is usable. Furthermore, it is somewhat notorious for shaming developers who mess things up (breaking a build or causing the site to go down by accident, for example) by posting a picture of the culprit wearing a clown nose on an internal Facebook group. No one wants to be seen on the wall-of-shame!
Facebook recognizes that there are significant flaws in its testing process, but rather than going to great lengths to improve, it simply accepts the flaws, since, as they say, “social media is nonessential.” Also, focusing less on testing means that more resources are available to focus on other, more valuable things.
Rather than testing its software through and through, Facebook tends to use “canary” releases and an incremental rollout strategy to test fixes, updates, and new features in production. For example, a new feature might first be made available only to a small percentage of the total number of users.
Canary Incremental Rollout
By tracking the usage of the feature and the feedback received, the company decides either to increase the rollout or to disable the feature, either improving it or discarding it altogether.
Amazon: Deployment comes first
Like Facebook, Amazon does not have a large QA infrastructure in place. It has even been suggested (at least in the past) that Amazon does not value the QA profession. Its ratio of about one test engineer to every seven developers also suggests that testing is not considered an essential activity at Amazon.
The company itself, though, takes a different view of this. To Amazon, the ratio of testers to developers is an output variable, not an input variable. In other words, as soon as it notices that revenue is decreasing or customers are moving away due to anomalies on the website, Amazon increases its testing efforts.
The feeling at Amazon is that its development and deployment processes are so mature (the company famously deploys software every 11.6 seconds!) that there is no need for elaborate and extensive testing efforts. It is all about making software easy to deploy, and, equally if not more important, easy to roll back in case of a failure.
Spotify: Squads, tribes and chapters
Spotify does employ dedicated testers. They are part of cross-functional teams, each with a specific mission. At Spotify, employees are organized according to what’s become known as the Spotify model, constructed of:
- Squads. A squad is basically the Spotify take on a Scrum team, with less focus on practices and more on principles. A Spotify dictum says, “Rules are a good start, but break them when needed.” Some squads might have one or more testers, and others might have no testers at all, depending on the mission.
- Tribes are groups of squads that belong together based on their business domain. Any tester that’s part of a squad automatically belongs to the overarching tribe of that squad.
- Chapters. Across different squads and tribes, Spotify also uses chapters to group people that have the same skillset, in order to promote learning and sharing experiences. For example, all testers from different squads are grouped together in a testing chapter.
- Guilds. Finally, there is the concept of a guild. A guild is a community of members with shared interests. These are a group of people across the organization who want to share knowledge, tools, code and practices.
Spotify Team Structure
Testing at Spotify is taken very seriously. Just like programming, testing is considered a creative process, and something that cannot be (fully) automated. Contrary to most other companies mentioned, Spotify heavily relies on dedicated testers that explore and evaluate the product, instead of trying to automate as much as possible. One final fact: In order to minimize the efforts and costs associated with spinning up and maintaining test environments, Spotify does a lot of testing in its production environment.
Microsoft: Engineers and testers are one
Microsoft’s ratio of testers to developers is currently around 2:3, and like Google, Microsoft pays testers and developers equally—except they aren’t called testers; they’re software development engineers in test (or SDETs).
The high ratio of testers to developers at Microsoft is explained by the fact that a very large chunk of the company’s revenue comes from shippable products that are installed on client computers & desktops, rather than websites and online services. Since it’s much harder (or at least much more annoying) to update these products in case of bugs or new features, Microsoft invests a lot of time, effort, and money in making sure that the quality of its products is of a high standard before shipping.
What you can learn from world-class product organizations? If the culture, views, and processes around testing and QA can vary so greatly at five of the biggest tech companies, then it may be true that there is no one right way of organizing testing efforts. All five have crafted their testing processes, choosing what fits best for them, and all five are highly successful. They must be doing something right, right?
Still, there are a few takeaways that can be derived from the stories above to apply to your testing strategy:
- There’s a “testing responsibility spectrum,” ranging from “We have dedicated testers that are primarily responsible for executing tests” to “Everybody is responsible for performing testing activities.” You should choose the one that best fits the skillset of your team.
- There is also a “testing importance spectrum,” ranging from “Nothing goes to production untested” to “We put everything in production, and then we test there, if at all.” Where your product and organization belong on this spectrum depends on the risks that will come with failure and how easy it is for you to roll back and fix problems when they emerge.
- Test automation has a significant presence in all five companies. The extent to which it is implemented differs, but all five employ tools to optimize their testing efforts. You probably should too.
Bottom line, QA is relevant and critical to the success of your product strategy. If you’d tried to implement a new QA process but failed, we can help.
Subscribe to the AKF Newsletter
September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-Axis, Y-Axis, and Z-Axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
Subscribe to the AKF Newsletter
July 20, 2018 | Posted By: Pete Ferguson
One of the most common questions we get is “What are the most common failures you see tech and product teams make?” To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to Design for Rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW. (Related Content: Monitoring for Improved Fault Detection)
2) Confusing Product Release with Product Success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy! (Related Content: Agile and the Cone of Uncertainty)
3) Insular Product Development / Engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work.” (Related Content: The No Surprises Rule)
4) Over Engineering the Solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
Image Source: Hackernoon.com
5) Allowing History to Repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Vendor Lock
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to Find Your Mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering! Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “Big Bang” Fixes
In our experience, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure – Eliminate Synchronous Calls
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc. then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to Create and Incentivize a Culture of Excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. (Related Content: Three Reasons Your Software Engineers May Not Be Successful)
11) Under-Engineer for Scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now! That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point of relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A New PDLC will Fix My Problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance, in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs. (Related Content: The Top Five Most Common PDLC Failures)
14) Inability to Hire Great People Quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) Diminishing or Ignoring SPOFs (Single Point of Failure)
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity Plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge, like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS-based company, the site and services provided is the company’s sole source of revenue! Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
If you are cloud hosted, this still applies to you! We often find in technical due diligence reviews that small companies who are rapidly growing haven’t yet initiated a second active tech stack in a different availability zone or with a second cloud provider. Just because AWS, Azure and others have a fairly reliable track record doesn’t mean they always will. You can outsource services, but you still own the liability!
Image Source: Kaibizzen.com.au
18) No Product Management Team or Person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) Failing to Implement Continuously
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down (back to #17 - multiple active sites also allows for continuous implementation and the ability to roll back). It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Subscribe to the AKF Newsletter
July 8, 2018 | Posted By: Robin McGlothin
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.
A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents. A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.
Our conversations with clients confirm that detecting these failures is a significant problem. AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them! Application-level failures can sometimes take days to detect, though they are repaired quickly once found. Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
The duration of a product impairment is TTR.
To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening. Then, rely upon application and system monitoring to inform you on where and what has failed. Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates. We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.
In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:
- Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc. Use math to determine abnormal behavior. This is the first line of defense.
- Synthetic Transactions (Simulate User Behavior): Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope. Alerts from a passive monitor can be confirmed or denied and escalated as appropriate. This is the second line of defense.
Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!)
If you can’t identify all problem areas, identify as many as possible. The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly! That’s when things start falling apart and people point fingers.
At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site. Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators. Effective monitoring will detect these faults as well.
The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.
In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control. The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified. If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected. Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.
One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems. An alternative to SPC is First and Second Derivative testing. These tests tell if the actual and expected curve forms are the same.
Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay.
We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the Network Operations Center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine – but our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
How would your company react to this critical change in normal usage patterns? Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
June 18, 2018 | Posted By: Pete Ferguson
In my short tenure at AKF, I have found the topic of Stored Procedures (SPROCs) to be provocatively polarizing. As we conduct a technical due diligence with a fairly new upstart for an investment firm and ask if they use stored procedures on their database, we often get a puzzled look as though we just accused them of dating their sister and their answer is a resounding “NO!”
However, when conducting assessments of companies that have been around awhile and are struggling to quickly scale, move to a SaaS model, and/or migrate from hosted servers to the cloud, we find “server huggers” who love to keep their stored procedures on their database.
At two different clients earlier this year, we found companies who have thousands of stored procedures in their database. What was once seen as a time-saving efficiency is now one of several major obstacles to SaaS and cloud migration.
In our book, Scalability Rules: Principles for Scaling Web Sites, (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites) Marty outlines many reasons why stored procedures should not be kept in the database, here are the top 8:
- Cost: Databases tend to be one of the most expensive systems or services within the system architecture. Each transaction cost increases with each additional SPROC. Increase cost of scale by making a synchronous call to the ERP system for each transaction – while also reducing the availability of the product platform by adding yet another system in series – doesn’t make good business sense.
- Creates a Monolith: SPROCs on a database create a monolithic system which cannot be easily scaled.
- Limits Scalability: The database is a governor of scale, SPROCS steal capacity by running other than relational transactions on the database.
- Limits Automated Testing: SPROCs limit the automation of code testing (in many cases it is not as easy to test stored procedures as it is the other code that developers write), slowing time to market and increasing cost while decreasing quality.
- Creates Lockin: Changing to an open-source or a NoSQL solution requires the need to develop a plan to migrate SPROCs or replace the logic in the application. It also makes it more difficult to switch to new and compelling technologies, negotiate better pricing, etc.
- Adds Unneeded Complexity to Shard Databases: Using SPROCs and business logic on the database makes sharding and replacement of the underlying database much more challenging.
- Limits Speed To The Weakest Link: Systems should scale independently relative to individual needs. When business logic is tied to the database, each of them needs to scale at the same rate as the system making requests of them - which means growth is tied to the slowest system.
- More Team Composition Flexibility: By separating product and business intelligence in your platform, you can also separate the teams that build and support those systems. If a product team is required to understand how their changes impact all related business intelligence systems, it will slow down their pace of innovation as it significantly broadens the scope when implementing and testing product changes and enhancements.
Per the AKF Scale Cube, we desire to separate dissimilar services - having stored procedures on the database means it cannot be split easily.
Need help migrating from hosted hardware to the cloud or migrating your installed software to a SaaS solution? We have helped hundreds of companies from small startups to well-established Fortune 50 companies better architect, scale, and deliver their products. We offer a host of services from technical due diligences, onsite workshops, and provide mentoring and interim staffing for your company.
Subscribe to the AKF Newsletter
< 1 2 3 4 >