April 3, 2017 | Posted By: AKF
The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic database and having multiple “cloned” applications servers retrieve and store data within that database. For young companies this is a very good approach as if done properly it will also eliminate the need for persistence or affinity to a given application server and as a result will increase customer perceived availability.
The problem, however, with this single monolithic data structure is threefold:
- Even with clustering technology (the existence of a second physical system or database that can take the load of the first in the event of failure), failures of the primary database will result in short service outages for 100% of the user community.
- This approach ultimately relies solely on technical improvements in cpu speed, memory access speed, memory access size, mass
storage access speeds and size, etc to insure the companies needs for scale.
- Relying upon (2) above in the extreme cases is not the most cost effective solutions as the newest and fastest technologies come at
a premium to older generations of technology and do not necessarily have the same processing power per dollar as older and/or
smaller (fewer cpus etc) systems.
As we have argued in the aforementioned post, a great engineering team will think about how to scale their platform well in advance of the need to rely solely upon partner technology advances. By making small modifications to our previously presented “Scale Cube”, the same concepts applied to the problem of splitting services for scale can be useful in addressing how to split a database for scale. As with the AKF Services Scale Cube, the AKF Database Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale transactions applied to a database. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic database – a case where all data is located in a single location and all accesses go to this single database.
The X Axis of the cube represents a means of spreading load across multiple instances of a replicated representation of the data. This is the first approach most companies use in scaling databases and is often both the easiest to implement and the least costly in both engineering time and hardware. Many third party and open source databases have native properties or functions that will allow the near real time replication of data to multiple “read databases”. The engineering cost of such an approach is low as typically database calls only need to be identified as a “read” or “write” and sent to the appropriate write database or bank of read databases. The “bank” of read databases should have reads evenly split across this if possible and many companies employ simple 3d party load balancers to perform this distribution.
Included in our Xaxis split are third party and open source caching solutions that allow reads to be split across “cache” hosts before actually reading from a database upon a cache miss. Caching is another simple way to reduce the load on the database but in our experience is not sufficient for hyper growth SaaS sites. If implemented properly, this Xaxis split also can increase availability as if replication is near real time, a read server can be promoted as the singular “write server” in the event of a “write server” failure. The combination of caching and read/write splits (our X axis) is sufficient for many companies but for companies with extreme hyper growth and massive data retention needs it is often not enough.
The Y Axis of our database cube represents a split by function, service or resource just as it did with the service cube. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in speeding up database calls by allowing more information specific to the request to be held in memory rather than needing to make a disk access. Just as with our approach in scaling services, our recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. It is possible to take a monolithic application and perform physical splits by say URL/URI to different service or resource oriented pools. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service / pool / resource / application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application and data set, especially when both the database and applications are split to represent a “failure domain” or fault isolative infrastructure.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). The most common way to view this is to consider splitting your resources by customer if your entity relationships allow that to happen. In the world of media, you might consider splitting it by article_id or media_id and in the world of commerce a split by product_id might be appropriate. In the case where you split customers from your products and perform splits within customers and products you would be implementing both a Y axis split (splitting by resource or call – customers and products) and a Z axis split (a
modulus of customers and products within their functional splits).
Z axis splits tend to be the most costly for an engineering team to perform as often many functions that might be performed within the database (joins for instance) now need to be performed within the application. That said, if done appropriately they represent the greatest potential for scale for most companies.
April 3, 2017 | Posted By: AKF
Splitting Applications or Services for Scale
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and potentially communicating with a database. Many if not all of the functions are likely to exist within a monolithic application code base making use of the same physical and virtual resources of the system upon which the functions operate: memory, cpu, disk, network interfaces, etc. Potentially the engineers have the forethought to make the system highly available by positioning a second application server in the mix to be used in the event that the first application server fails.
This monolithic design will likely work fine for many sites that receive low levels of traffic. However, if the product is very successful and receives wide and fast adoption user perceived response times are likely to significantly degrade to the point that the product is almost entirely unusable. At some point, the system will likely even fail under the load as the inbound request rate is significantly greater than the processing power of the system and the resulting departure rate of responses to requests.
A great engineering team will think about how to scale their platform well in advance of such a catastrophic failure. There are many ways to approach how to think about such scalability of a platform and we present several through a representation of a three dimensional cube addressing three approaches to scale that we call the AKF Scale Cube.
The AKF Scale Cube (aka Scale Cube and AKF Cube) consists of an X, Y and Z axes – each addressing a different approach to scale a service. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic service or product identified above: a product wherein all functions exist within a single code base on a single server making use of that server’s finite resources of memory, cpu speed, network ports, mass storage, etc.
The X Axis of the cube represents a means of spreading load across multiple instances of the same application and data set. This is the first approach most companies use to scale their services and it is effective in scaling from a request per second perspective. Oftentimes it is sufficient to handle the scale needs of a moderate sized business. The engineering cost of such an approach is low compared to many of the other options as no significant rearchitecting of the code base is required unless the engineering team needs to eliminate affinity to a specific server because the application maintains state. The approach is simple: clone the system and service and allow it to exist on N servers with each server handling 1/Nth the total requests. Ideally the method of distribution is a loadbalancer configured in a highly available manner with a passive peer that becomes active should the active peer fail as a result of hardware or software problems. We do not recommend leveraging roundrobin DNS as a method of load balancing. If the application does maintain state there are various ways of solving this including a centralized state service, redesigning for statelessness, or as a last resort using the load balancer to provide persistent connections. While the Xaxis approach is sufficient for many companies and distributes the processing of requests across several hosts it does not address other potential bottlenecks like memory constraints where memory is used to cache information or results.
The Y Axis of the cube represents a split by function, service or resource. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in reducing or distributing the amount of memory dedicated to any given application across several systems. A recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. As a quick first step, a monolithic application can be placed on multiple servers and dedicate certain of those servers to specific “services” or URIs. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service/pool/resource/application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as
“swimlaning” an application.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). As with the Y axis split, this split aids not only fault isolation, but significantly reduces the amount of memory necessary
(caching, etc) for most transactions and also reduces the amount of stabile storage to which the device/service needs attach. In this case, you might try a modulus by content id (article), or listing id, or a hash from the received IP address, etc. The Z axis split is often the most costly of all splits and we only recommend it for clients that have hypergrowth or very high rates of transaction. It should only be used after a company has implemented a very granular split along the Y axis. That said, it also can offer the greatest degree of scalability as the number of “swimlanes within swimlanes” that it creates is virtually limitless. For instance, if a company implements a Z axis split as a modulus of some transaction id and the implementation is a configurable number “N”, then N can be 10, 100, 1000, etc and each order of magnitude increase in N creates nearly an order of magnitude of greater scale for the company.
April 3, 2017 | Posted By: AKF
Here are a baker’s dozen of items that we feel are Best Practices for Scalability:
Use asynchronous communication when possible. Synchronous calls tie the availability of the two services together. If one has a failure or is slow the other one is affected.
2. Swim Lanes
Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
Replicate databases for recovery as well as to off load reads to multiple instances.
Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
7. Use Few RDBMS Features
Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
8. Slow Roll
Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
9. Load & Performance Testing
Test the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
10. Capacity Planning / Scalability Summits
Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
Always have the ability to rollback a code release.
12. Root Cause Analysis
Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
13. Quality From The Beginning
Quality can’t be tested into a product, it must be designed in from the beginning.
April 3, 2017 | Posted By: AKF
As a frequent technology writer I often find myself referring to the method or process that teams use to produce software. The two terms that are usually given for this are software development life cycle (SDLC) and product development life cycle (PDLC). The question that I have is are these really interchangeable? I don’t think so and here’s why.
Wikipedia, our collective intelligence, doesn’t have an entry for PDLC, but explains that the product life cycle has to do with the life of a product in the market and involves many professional disciplines. According to this definition the stages include market introduction, growth, mature, and saturation. This really isn’t the PDLC that I’m interested in. Under new product development (NDP) we find a defintion more akin to PDLC that includes the complete process of bringing a new product to market and includes the following steps: idea generation, idea screening, concept development, business analysis, beta testing, technical implementation, commercialization, and pricing.
Under SDLC, Wikipedia doesn’t let us down and explains it as a structure imposed on the development of software products. In the article are references to multiple different models including the classic waterfall as well as agile, RAD, and Scrum and others.
In my mind the PDLC is the overarching process of product development that includes the business units. The SDLC is the specific steps within the PDLC that are completed by the technical organization (product managers included). An image on HBSC’s site that doesn’t seem to have any accompanying explanation depicts this very well graphically.
Another way to explain the way I think of them is to me all professional software projects are products but not all product development includes software development. See the Venn diagram below. The upfront (bus analysis, competitive analysis, etc) and back end work (infrastructure, support, depreciation, etc) are part of the PDLC and are essential to get the software project created in the SDLC out the door successfully. There are non-software related products that still require a PDLC to develop.
Do you use them interchangeably? What do you think the differences are?
April 3, 2017 | Posted By: AKF
Two of our previous articles, Splitting Databases for Scale and Splitting Applications or Services for Scale have made references to a concept that we call “Swimlaning Architectures”. We sometimes also call “swimlanes” fault isolation zones or fault isolated architecture.
The basics of this concept are covered in our two previous posts, but we have not spent a lot of time discussing the reasons for such a split or approach in technology architecture.
In our definition, a “Swimlane” is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of said boundary. The benefit of such a failure domain is twofold:
1) Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
2) Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.
A “swimlaned” architecture is one in which each failure domain is completely isolated. In order to achieve this, ideally there are no calls between swimlanes or failure domains. Synchronous calls are absolutely forbidden in this type of architecture as any synchronous call between failure domains, even with appropriate timeout and detection mechanisms is very likely to cause a series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a call to any other service in another domain, to any service outside of the domain, or if the domain receives calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains. If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services. Here is an interesting blog post about runaway scripts and their impact on Apache, PHP, and MySQL.
As we have previously indicated, a swimlane should have all of its services located within the failure domain. For instance, if database accesses are necessary the database with all appropriate information for that swimlane should exist within the same failure domain as all of the application and webservers necessary to perform the function or functions of the swimlane. Furthermore, that database should not be used for other requests of service from other swimlanes. Our rule is one production database on one host.
As we have indicated with our Scale Cube in the past, there are many ways in which to think about swimlaned architectures. You can think about them in terms of a separation of services e.g. “login” and “shopping cart” (two separate swimlanes) each having the web and app servers as well as all data stores located within the swimlane and answering only to systems within that swimlane. Corresponding to the Scale Cube we have previously introduced this would be a “Y” axis swimlane.
Another approach would be to perform a separation of your customer base or a separation of your order numbers or product catalog.
Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z axis swimlane along customer, order number or product id lines.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform.
April 3, 2017 | Posted By: AKF
We often use the term minimum viable product or MVP but do we all agree on what it means? In the Scrum spirt of Definition of Done, I believe the Definition of MVP is worth stating explicitly within your tech team. A quick search revealed these three similar yet different definitions:
- A minimum viable product (MVP) is a development technique in which a new product or website is developed with sufficient features to satisfy early adopters. The final, complete set of features is only designed and developed after considering feedback from the product’s initial users. Source: Techopedia
- In product development, the minimum viable product (MVP) is the product with the highest return on investment versus risk…A minimum viable product has just those core features that allow the product to be deployed, and no more. Source: Wikipedia
- When Eric Ries used the term for the first time he described it as: A Minimum Viable Product is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort.
I personally like a combination of these definitions. I would choose something along the lines of:
A minimum viable product (MVP) has sufficient features to solve a problem that exists for customers and provides the greatest return on investment reducing the risk of what we don’t know about our customers
Just like no two teams implement Agile the same way, we don’t all have to agree on the definition of MVP but all your team members should agree. Otherwise, what is an MVP to one person is a full featured product to another. Take a few minutes to discuss with your crossfunctional agile team and come to a decision on your Definition of MVP
April 3, 2017 | Posted By: AKF
In many of our engagements, we find ourselves helping our clients understand when it’s appropriate to build and when they should buy.
If you perform a simple web search for “build v. buy” you will find hundreds of articles, process flows and decision trees on when to build and when to buy. Many of these are costcentric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. Does this “thing” (product / architectural component / function) create strategic differentiation in our business?
Here we are talking about whether you are creating switching costs, lowering barriers to exit, increasing barriers to entry, etc that would give you a competitive advantage relative to your competition. See Porter’s Five Forces for more information about this topic. If the answer to this question is “No – it does not create competitive differentiation” then 99% of the time you should just stop there and attempt to find a packaged product, open source solution, or outsourcing vendor to build what you need. If the answer is “Yes”, proceed to question 2.
2. Are we the best company to create this “thing”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else. For instance, if you are a social networking site, you *probably* don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”. And please, don’t fool yourselves – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. Are there few or no competing products to this “thing” that you want to create?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision. If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity. Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time. As a result, a “build” decision today will look bad tomorrow as features converge and pricing declines. If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. Can we build this “thing” cost effectively?
Is it cheaper to build than buy when considering the total lifecycle (implementation through endoflife)
of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc. If your business REALLY grows and is extremely successful, do you really want to be continuing to support internally developed load balancers, databases, etc. through the life of your product? Don’t fool yourself into answering this affirmatively just because you want to work on something neat. Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing”, but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes”.
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a
concern) anytime you answer “No” to any question above.
March 23, 2017 | Posted By: AKF
The caller ID was blocked but Marty had been expecting the call. Three “highly connected” people – donors, political advisers and “inner circle” people – had suggested AKF could help. It was October 2013 and Healthcare.gov had launched only to crash when users tried to sign up. President Obama appointed Jeffrey Zients to mop up the post launch mess. Once the crisis was over, the Government Accountability Office (GAO) released its postmortem citing inadequate capacity planning, software coding errors, and lack of functionality as root causes. AKF’s analysis was completely different – largely because we think differently than most technologists. While our findings indicated the bottlenecks that kept the site from scaling, we also identified failures in leadership and a dysfunctional organization structure. These latter, and more important, problems prevented the team from identifying and preventing recurring issues.
We haven’t always thought differently. Our early focus in 2007 was to help companies overcome architectural problems related to scale and availability. We’ve helped our clients solve some of the largest and challenging problems ever encountered – cyber Monday ecommerce purchasing, Christmas day gift card redemption, and April 15th tax filings. But shortly after starting our firm, we realized there was something common to our early engagements that created and sometimes turbocharged the technology failures. This realization, that people and processes – NOT TECHNOLOGY– are the causes of most failures led us to think differently. Too often we see technology leaders focusing too much on the technology and not enough on leading, growing, and scaling their teams.
We challenge the notion that technology leaders should be selected and promoted based on their technical acumen. We don’t accept that a technical leader should spend most of her time making the biggest technical decisions. We believe that technical executives, to be successful, must first be a business executive with great technical and business acumen. We teach teams how to analyze and successfully choose the appropriate architecture, organization, and processes to achieve a business outcome. Product effort is meaningless without a measurable and meaningful business outcome and we always put outcomes, not technical “religion” first.
If we can teach a team the “AKF way” the chance of project and business success increases dramatically. This may sound like marketing crap (did we mention we are also irreverent?), but our clients attest to it. This is what Terry Chabrowe, CEO eMarketer, said about us:
AKF served as our CTO for about 8 months and helped us make huge improvements in virtually every area related to IT and engineering. Just as important, they helped us identify the people on our team who could move into leadership positions. The entire AKF team was terrific. We’d never have been able to grow our user base tenfold without them.
A recent post claimed that 93% of successful companies abandon their original strategy. This is certainly true for AKF. Over the past 10 years we’ve massively changed our strategy of how we “help” companies. We’ve also quadrupled our team size, worked with over 350 companies, written three books, and most importantly made some great friendships. Whether you’ve read our books, engaged with our company, or connected with us on social media, thanks for an amazing 10 years. We look forward to the next 10 years, learning, teaching, and changing strategies with you.
March 21, 2017 | Posted By: AKF
Read about one of our success stories in which we filled an interim CTO role at a marketing subscription company in New York.
AKF Long-term Case Study
< 1 2 3