April 27, 2019 | Posted By: Marty Abbott
This article is the fourth in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services (as in the case of creating a microservice architecture). Many of the mistakes or failure points teams create in services splits. Articles two and three cover anti-patterns for service and data fan out respectively.
The Service Fuse, the topic of this microservice anti-pattern, exists when two or more unique services share a commonly deployed service pool. When the shared service “C” fails, service A and B fail as well. Similarly, when service “C” becomes slow, slowness under high demand propagates to services A and B.
As is the case with any group of services connected in series, Service A’s theoretical availability is the product of its individual availability combined with the availability of service C. Service B’s theoretical availability is calculated similarly. Under unusual conditions, the availability of A could also impact B similar to the way in which service fan out works. Such would be the case if A somehow holds threads for C, thereby starving it of threads to serve B.
Because overall availability is negatively impacted, we consider the Service Fuse to be a microservice anti-pattern.
The easiest and most common method to fault isolate the failure and response time propagation of Service C is to deploy it separately (in separate pools) for both Service A and B. In doing so, we ensure that C does not fail for one service as a result of unusual demand from the other. We also isolate failures due to unique requests that might be made by either A or B. In doing so, we do incur some additional operational costs and additional coordination and overhead in releases. But assuming proper automation, the availability and response time improvements are often worth the minor effort.
As with many of our other anti-patterns we can also employ dynamically loadable libraries rather than separate service deployments. While this approach has some of the slight overhead (again assuming proper automation) of the above separate service deployments, it often also benefits from significant server-side response time decreases associated with network transit.
We often see teams over emphasizing the cost of additional deployments. But the separate service deployment or dynamically loadable library deployment seldom results in significantly greater effort. Splitting the capacity of a shared pool relative to the demand split between services A and B (e.g. 50/50, 90/10, etc) and adding a small number of additional services for capacity is the real implication of such a split. Is 5 to 10% additional operational cost and seconds of additional deployment time worth the significant increase in availability? Our experience is that most of the time it is.
April 21, 2019 | Posted By: Pete Ferguson
Results = Results
Apple, Google, and Amazon don’t exist based on a Utopian promise of what is to come – though certainly those promises keep their customers engaged and hopeful for the future. These companies exist because of the value they have delivered to date and created expectations for us as consumers for a consistent result.
I’m amazed at how simple of a concept Results = Results is – yet constantly we see companies struggle with the concept and we see it as a recurring theme in our 2-3 day workshops with our clients and something we look for in our technical due diligence reviews.
As a corporate survivor of 18 years, looking back I can see where I was distracted by day-today meetings, firefighting, and getting hijacked by initiatives that seemed urgent to some senior leader somewhere – but were not really all that important.
Suddenly the quarter or half was over and it was time to do a self-evaluation and realize all the effort, all the stress, all the work, wasn’t getting the desired results I’d committed to earlier in the year and I’d have to quickly shuffle and focus on getting stuff done.
While keeping the lights on is important, it diminishes in importance when to do so is at the expense of innovating and adding value to our customers – not just struggling to maintain the status quo.
Outcomes and Key Results (OKRs)
Adapted from John Doerr’s “Objectives” and key results – at AKF we find it more to the point to focus on “outcomes.” Objectives (definition: a thing aimed at or sought) are a path where as “outcomes” are a destination that is clearly defined to know you have arrived.
Outcomes are the only things that matter to our customers. Hearing about a desired Utopian state is great and may excite customers to stick around for awhile and put up with current limitations or lack of functionality – but being able to clearly define that you have delivered an outcome and the value to your customers is money in the bank and puts us ahead of our competition.
Yet the majority of our clients have teams who are so focused on cost-cutting for many years that they leave a wide open berth for young startups and their competition to move in and start delivering better outcomes for the customer.
How to Focus on Results and Outcomes
It is easy to become distracted in the day-to-day meetings, incident escalations, post mortems, ect. As an outside third party, however, it is blatantly obvious to us usually within the first hour of meeting with a new team whether or not they are properly focused.
Here are some of the common themes and questions to ask:
- Is there effective monitoring to discover issues before our customers do?
- Do we monitor business metrics and weigh the success (and failure) of initiatives based not on pushing out a new platform or product but whether or not there was significant ROI?
- How much time is spent limping along to keep a legacy application up and running vs. innovating?
- Do we continually push off hardware/software upgrades until we are held hostage by compliance and/or end-of-life serviceability by the vendor?
Hopefully the common theme here is obvious – what is the customer experience and how focused are we on them vs internal castle building or day-to-day distractions?
Recently in a team interview the IT “keep the lights on” team told us they were working to be strategic and innovative by hiring new interns. While the younger generations are definitely less prone to accepting the status quo, the older generation are conceding that they don’t want to be part of the future. And unfortunately they may not be sooner than planned if they don’t grasp their role in driving innovation and the importance of applying their institutional knowledge.
Not focusing on customer/shareholder related outcomes means that shareholders and customers are negatively impacted. Here are a few problems with the associated outcomes I’ve seen in my short tenure with AKF and previously as a corporate crusader:
Monolithic applications to save costs: Why organizations do it? Short term cost savings focus development on one application. Allows teams to only focus on development of their one area.
- One failure means everyone fails.
- Organizations are unable to scale vis-a-vis Conway’s Law (organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations).
- Often the teams who develop the monolith don’t have to support it, so they don’t understand why it is a problem.
- Teams become very focused on solving the problems caused by the monolith just long enough to get it back up and running but fail to see the long-term recurrent loss to the business and wasted hours that could have been spent on innovating new products and services.
- Catastrophic failure - Intuit pre SaaS, early renditions of iTunes and annual outages when everyone tried to redeem gift cards Christmas morning, early days of eBay, stay tuned, many more yet to come.
Ongoing cost cutting to “make the quarter.”
- MIssed tech refresh results in machines and operating systems no longer supported and vulnerable to external attacks.
- Teams become hyper focused on shutting down additional spending, but never take the time to calculate how much wasted effort is spent on keeping the lights on for aging systems with a declining market share or slowed new customer adoption rate.
- Start saying no to the customer based on cost opening the door for new upstarts and the competition to take away market share.
Focusing efforts on Sales Department’s latest contract.
- Too much investment in legacy applications instead of innovating new products.
- “A-team” developers become firefighters to keep customers happy.
- Sales team creates moral hazards for development teams (i.e. “I smoke, but you get lung cancer” - teams create problems for other teams to fix instead of owning the end-to-end lifecycle of a product)
Focus is on mergers and acquisitions instead of core strengths and products.
- Distracted organizations give way for upstarts and competition.
- Become okay or maybe even good at a lot of things but not great at one or two things.
- Company culture becomes very fragmented and silos create red tape that slows or stifles innovation.
Results = Results. And nothing else equals results.
If OKRs are not measuring the results needed to compete and win, then teams are wasting a lot of effort, time, and money and the competition is getting a free pass to innovate and outperform your ability to delight and please your customers.
Need an outside view of your organization to help drive better results and outcomes? Contact us!
Photo by rawpixel.com from Pexels
April 21, 2019 | Posted By: Marty Abbott
This article is the third in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern. The second article, Service Fan Out discusses the anti-pattern of a single service acting as a proxy or aggregator of mulitple services.
Data Fan Out, the topic of this microservice anti-pattern, exists when a service relies on two or more persistence engines with categorically unique data, or categorically similar data that is not meant to be processed in parallel. “Categorically Unique” means that the data is in no way related. Examples of categorical uniqueness would be a database that stores customer data and a separate database that stores catalog data. Instances of the same data, such as two separate databases each storing half of product catalog, are not categorically unique. Splitting of similar data is often known as sharding. Such “sharded” instances only violate the Data Fan Out pattern if:
1) They are accessed in series (database 1 is accessed and subsequently database 2 is accessed) –or-
2) A failure or slowness in either database, even if accessed in parallel, will result in a very slow or unavailable service.
Persistence engine means anything that stores data as in the case of a relational database, a NoSQL database, a persistent off-system cache, etc.
Anytime a service relies on more than one persistence engine to perform a task, it is subject to lower availability and a response time equivalent to the slower of the N data stores to which it is connected. Like the Service Fan Out anti-pattern, the availability of the resulting service (“Service A”) is the product of the availability of the service and its constituent infrastructure multiplied by the availability of each N data store to which it is connected.
Further, the response of the services may be tied to the slowest of the runtime of Service A added to the slowest of the connected solutions. If any of the N databases become slow enough, Service A may not respond at all.
Because overall availability is negatively impacted, we consider Data Fan Out to be a microservice anti-pattern.
One clear exception to the Data Fan Out anti-pattern is the highly parallelized querying done of multiple shards for the purpose of getting near linear response times out of large data sets (similar to one component of the MapReduce algorithm). In a highly parallelized case such as this, we propose that each of the connections have a time-out set to disregard results from slowly responding data sets. For this to work, the result set must be impervious to missing data. As an example of an impervious result set, having most shards return for any internet search query is “good enough”. A search for “plumber near me” returns 19/20ths of the “complete data”, where one shard out of 20 is either unavailable or very slow. But having some transactions not present in an account query of transactions for a checking account may be a problem and therefore is not an example of a resilient data set.
Our preferred approach to resolve the Data Fan Out anti-pattern is to dedicate services to each unique data set. This is possible whenever the two data sets do not need to be merged and when the service is performing two separate and otherwise isolatable functions (e.g. “Customer_Lookup” and “Catalog_Lookup”).
When data sets are split for scale reasons, as is the case with data sets that have both an incredibly high volume of requests and a large amount of data, one can attempt to merge the queried data sets in the client. The browser or mobile client can request each dataset in parallel and merge if successful. This works when computational complexity of the merge is relatively low.
When client-side merging is not possible, we turn to the X Axis of the Scale Cube for resolution. Merge the data sets within the data store/persistence engine and rely on a split of reads and writes. All writes occur to a single merged data store, and read replicas are employed for all reads. The write and read services should be split accordingly and our infrastructure needs to correctly route writes to the write service and reads to the read service. This is a valuable approach when we have high read to right ratios – fortunately the case in many solutions. Note that we prefer to use asynchronous replication and allow the “slave” solutions to be “eventually consistent” - but ideally still within a tolerable time frame of milliseconds or a handful of seconds.
What about the case where a solution may have a high write to read ratio (exceptionally high writes), and data needs to be aggregated? This rather unique case may be best solved by the Z axis of the AKF Scale Cube, splitting transactions along customer boundaries but ensuring the unification of the database for each customer (or region, or whatever “shard key” makes sense). As with all Z axis shards, this not only allows faster response times (smaller data segments) but engenders high scalability and availability while also allowing us to put data “closer to the customer” using the service.
AKF Partners helps companies create highly available, highly scalable, easily maintained and easily developed microservice architectures. Give us a call - we can help!
April 19, 2019 | Posted By: Eric Arrington
Time it takes to boil an egg: 720,000 milliseconds
Average time in line at the supermarket: 240,000 milliseconds
Time it takes to brush your teeth: 120,000 milliseconds
Time it takes to make a sandwich: 90,000 milliseconds
In our everyday lives we aren’t used to measuring things in milliseconds. In the software world our users’ expectations are different. Milliseconds matter. A lot.
The average person may wait 240,000 milliseconds to checkout at the grocery store but not as likely to wait that long to checkout on an e-commerce site.
What Is Latency
Latency is how fast we get an answer back to after making a request to the server.
It’s how long it takes for a request to go from the browser to the server and back to the browser.
Spoiler Alert: Faster is better.
Latency vs Bandwidth
I often see the words latency and bandwidth used together – or even interchangeably – but they have two very different meanings.
Using the metaphor of a restaurant, bandwidth is the amount of seating available. The more seating the restaurant has, the more people it can serve at one time. If a restaurant wants to be able to serve more people in a certain time period they add more seating. Similarly, bandwidth is the maximum amount of data that can be transferred in a specific measure of time.
If bandwidth is the maximum number of diners that can fit in a restaurant at one time, then latency is the amount of time it takes for food to arrive after ordering. On the Internet, latency is a measure of how long it takes for a user to get a response from an action like a click. It is the “performance lag” the user feels while using our product.
Luckily, over the past 20 years, the bandwidth and capacity of memory have increased dramatically. Unfortunately, latency hasn’t increased at all comparatively over the last 20 years.
Latency is directly linked to the “experience” the end user has with our products or services. If our latency isn’t maximized then we are leaving money on the table!
- Amazon did a study that found for every 100ms of latency it cost them 1% in sales.
- Google discovered that for every 500ms they took to show search results, traffic dropped 20%.
Even more shocking is a study done by the TABB Group. The study estimated the outcome of a broker’s electronic trading platform being just 5 milliseconds behind the competition. According to their estimate, this 5 millisecond delay could cost $4 million in revenue per millisecond. Their study also concluded that if an electronic broker is 100 milliseconds behind the competition they might as well shut down and become a floor broker.
100ms can be the difference between strategic advantage and second or third place.
100ms Rule of Latency – Paul Buchheit (Gmail Creator)
How fast is 100ms? Paul Buchheit coined the The 100ms Rule. The rule states that every interaction should be faster than 100ms. Why? 100ms is the threshold “where interactions feel instantaneous.”
What Causes Latency
Finding the cause of all of our latency isn’t always an easy task. There are a lot of possible causes. For the most part we can borrow the Pareto Principle (80/20 Rule) and knock out the usual suspects.
Propagation is how long it takes information to travel. In a perfect world our request travels at the speed of light. Also in a perfect world milkshakes would be good for us. For various reasons, our packet won’t travel at the speed of light.
Even if it did travel at the speed of light, distance from between our server and our web user still matters.
Packets traveling from one side of the world to the other and back would add about 250ms of latency. Unfortunately our data doesn’t travel “as the crow flies.” The paths rarely travel in a straight line (especially if using a VPN). This adds a lot more distance for the request to travel.
Remember when I said it wasn’t a perfect world? This is what I was talking about. The material data cables are made out of affect the speed of propagation. Different materials have different limitations on the speed.
For example, the speed of light can travel from New York to San Francisco in 14ms (in a vacuum). Inside of a fiber cable it takes about 21ms.
For the most part, data travels fast across long distances. The cabling mediums between larger distances is usually faster. The last mile is usually the slowest. One reason for this is the cable medium used in buildings, homes, and commercial areas tend to use existing wiring like coaxial cables or copper. Another reason is explained in the next point. Your data changes hands as it gets back to you (i.e. your router).
Consider yourself lucky if you have fiber installed. Copper and coaxial cables are slower. 4G can add up to 100ms to the latency. We won’t even talk about satellite.
It would be great if our data went straight from our device to the server and back, but again, probably not going to happen. As our packet travels to the server and back to the source it travels through different network devices. The request passes through routers, bridges, and gateways. Each time our data is handed off to the next device, a “network hop” occurs.
These hops add more latency than distance. A request that travels 100 miles but makes 5 hops will have more latency than a request that travels 2500 miles with only 2 hops.
The more hops are in the line, the more latency.
How To Lower Latency?
Latency can best be described as the sum of the previously mentioned causes and a lot more. There is no magical button we can push to achieve ultra low latency. There are a few things that can make a big dent. This is in no means an exhaustive list.
Asynchronous Development Approach
Multitasking as a developer is a bad idea. Making software multitask is a great idea. Whenever possible, make calls asynchronous (multiple calls executed at the same time). This can make a huge difference in latency (and perceived latency which can be just as important, we’ll talk more about that shortly).
Make Fewer External Requests
If we know that the trip to and from the database adds latency, then let’s go less often. There are many ways we can do this. Here are a few:
- Use image sprites
- Eliminate images that don’t contribute to overall product
- Use inline svg code instead of images for icons and logos
- Combine and minify all HTML, CSS, and JS files
There are times we need to reference files from with an external HTTP request. If we don’t control those resources then there is little we can do. We can, however, evaluate and reduce the number of external services we use.
Z Axis split geographically
One of the things AKF Partners is known for is the Scalability Cube. We have helped hundreds of clients scale along all three axises.
If our architecture makes sense to do so, splitting along the Z Axis by geography can make a huge difference in latency. Separating the data based on the geographical location hopefully places servers closer to the end user, thus shortening the round trip. If you would like an evaluation to see if this is something that could be achieved with your current architecture, don’t hesitate to contact us.
Use a CDN
A CDN is a content delivery network. Basically it is a system of distributed servers. These servers deliver pages or other content to end users based on their geographic location. Remember shorter distances mean lower latency.
A CDN can have a huge impact on latency. I stole a few figures from the KeyCDN website to show you how big of a difference. Test site was located in Dallas, TX.
||No CDN (ms)
||With CDN (ms)
A content delivery network can have a major impact in reducing latency.
The details of how caching works can be complicated but the basic idea is simple. If I were to ask you what the result of 8 x 7 is, you will know right away the answer is 56. You didn’t have to think about it. You didn’t calculate it in your head. You’ve done this multiplication so many times in your life that you don’t need to. You just remember the answer. That is kind of how caching works.
If we are set up to cache a page on the server, the first time someone visits our page, it loads normally because the request is received by the server, processed, and sent back to the client as an html file. If we are set up to cache the request, then the HTML file is saved and stored - usually in RAM (which is fast). The next time we make the same request, the server doesn’t need to process anything. It simply serves the HTML page from the cache.
We can also cache at the browser level. The first time a user visits a site, the browser will receive a bunch of assets and (if set up correctly) will cache those assets. The next time the site is visited, the browser will serve the cached assets and it will load a lot quicker.
We decide when these caches expire. Be aggressive with caching of static resources. Set the expiration date for a minimum of a month in advance. I recommend setting them for 12 months if it makes sense for that resource.
Render Content on the Server Side
Rendering templates on the server (rather than dynamically on the client) can also help lower latency. Remember every trip to the database results in higher total latency. Why not pre-render the pages on the server and load the static pages on the client?
This technique doesn’t work for all applications – but content publishing sites like The Washington Post or Medium can benefit greatly from pre-rendering their content on the server side and posting static rendered content on the client.
Use Pre-fetching Methods
I almost didn’t include this tip. Technically this doesn’t lower latency at all. What it does do is lower the “perceived user latency” felt by customers.
Perceived user latency is how long it seems like it takes to the user.
A normal request looks like the graphic below. First is the time it takes the server to process the request. After that is the time it takes the network to get the request back to the client. Next is the time it takes the client to process the request and load the page.
If we pre-fetch certain items or show placeholder images while the response from the server is loading (a la Facebook) then the perceived latency felt by the user is lower. The actual latency is exactly the same but the user “feels” like it loaded faster.
We can get the same benefits of a lower latency by “gaming” latency this way.
Latency is a metric we should all be tracking. Providing a great user experience with low latency makes a difference. It keeps our customers on our applications and sites longer. It fosters retention. Most importantly it will increase conversion rate.
If you aren’t currently tracking your latency as a metric, take that first step and see where you are at. If you need help, let us know and we’d be happy to schedule a call.
April 8, 2019 | Posted By: Marty Abbott
This article is the second in a multi-part series on microservices (micro-services) anti-patterns. The introduction of the first article, Service Calls In Series, covers the benefits of splitting services, many of the mistakes or failure points teams create in services splits and the first anti pattern.
Fan Out, the topic of this microservice anti-pattern, exists when one service either serves as a proxy to two or more downstream services, or serves as an integration of two subsequent service calls. Any of the services (the proxy/integration service “A”, or constituent services “B” and “C”) can cause a failure of all services. When service A fails, service B and C clearly can’t be called. If either service B or C fails or becomes slow, they can affect service A by tying up communication ports. Ultimately, under high call volume, service A may become unavailable due to problems with either B or C.
Further, the response of the services may be tied to the slowest responding service. If A needs both B and C to respond to a request (as in the case of integration), then the speed at which A responds is tied to the slowest response times of B and C. If service A merely proxies B or C, then extreme slowness in either may cause slowness in A and therefore slowness in all calls.
Because overall availability is negatively impacted, we consider Service Fan Out to be a microservice anti-pattern.
One approach to resolve the above anti-pattern is to employ true asynchronous messaging between services. For this to be successful, the requesting service A must be capable of responding to a request without receiving any constituent service responses. Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A. One such example is a recommendation engine that returns other items a user might like to purchase. The absence of service B responding to A’s request for recommendations is unfortunate but doesn’t eliminate the value of A’s response completely.
As was the case with the Calls In Series Anti-Pattern, we may also be able to solve this anti-pattern with ”Libraries for Depth” pattern.
Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to a separately deployed service call. For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc. Additionally, call latency goes down without network interfaces.
The most common complaint about this pattern is that development teams cannot release independently. But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.
AKF Partners has helped to architect some of the most scalable, highly available, fault-tolerant and fastest response time solutions on the internet. Give us a call - we can help.
March 25, 2019 | Posted By: Marty Abbott
This article is the first in a multi-part series on microservices (micro-services) anti-patterns.
There are several benefits to carving up very large applications into service-oriented architectures. These benefits can include many of the following:
- Higher availability through fault isolation
- Higher organizational scalability through lower coordination
- Lower cost of development through lower overhead (coordination)
- Faster time to market achieved again through lower overhead of coordination
- Higher scalability through the ability to independently scale services
- Lower cost of operations (cost of goods sold) through independent scalability
- Lower latency/response time through better cacheability
The above should be considered only a partial list. See our articles on the AKF Scale Cube, and when you should split services for more information.
In order to achieve any of the above benefits, you must be very careful to avoid common mistakes.
Most of the failures that we see in microservices stem from a lack of understanding of the multiplicative effect of failure or “MEF”. Put simply, MEF indicates that the availability of any solution in series is a product of the availability of all components in that series.
Service A has an availability calculated by the product of its constituent parts. Those parts include all of the software and infrastructure necessary to run service A. The server availability, the application availability, associated library and runtime environment availabilities, operating system availability, virtualization software availability, etc. Let’s say those availabilities somehow achieve a “service” availability of “Five 9s” or 99.999 as measured by duration of outages. To achieve 99.999 we are assuming that we have made the service “highly available” through multiple copies, each being “stateless” in its operation.
Service B has a similar availability calculated in a similar fashion. Again, let’s assume 99.999.
If, for a request from any customer to Service A, Service B must also be called, the two availabilities are multiplied together. The new calculated availability is by definition lower than any service in isolation. We move our availability from 99.999 to 99.998.
When calls in series between services become long, availability starts to decline swiftly and by definition is always much smaller than the lowest availability of any service or the constituent part of any service (e.g. hardware, OS, app, etc).
This creates our first anti-pattern. Just as bulbs in the old serially wired Christmas Tree lights would cause an entire string to fail, so does any service failure cause the entire call stream to fail. Hence multiple names for this first anti-pattern: Christmas Tree Light Anti-Pattern, Microservice Calls in Series Anti-Pattern, etc.
The multiplicative effect of failure sometimes is worse with slowly responding solutions than with failures themselves. We can easily respond from failures through “heartbeat” transactions. But slow responses are more difficult. While we can use circuit breaker constructs such as hystrix switches – these assume that we know the threshold under which our call string will break. Unfortunately, under intense flash load situations (unforeseen high demand), small spikes in demand can cause failure scenarios.
One pattern to resolve the above issue is to employ true asynchronous messaging between services. To make this effective, the requesting service must not care whether it receives a response. This service must be capable of responding to a request without receiving any downstream response. Unfortunately, this solution only works in some cases such as the case where service B is returning data that adds value to service A. One such example is a recommendation engine that returns other items a user might like to purchase. The absence of service B responding to A’s request for recommendations is unfortunate, but doesn’t eliminate the value of A’s response completely.
While the above pattern can resolve some use-cases, it doesn’t resolve most of them. Most often downstream services are doing more than “modifying” value for the calling service: they are providing specific necessary functions. These functions may be mail services, print services, data access services, or even component parts of a value stream such as “add to cart” and “compute tax” during checkout.
In these cases, we believe in employing the Libraries for Depth pattern.
Of course, each of the libraries also represents a constituent part that may fail for any call – but the number of moving parts for each constituent part decreases significantly relative to another service call. For instance, no network interface is required, no additional host and virtual VM is employed during the call, etc. Additionally, call latency goes down without network interfaces.
The most common complaint about this pattern is that development teams cannot release independently. But, as we all know, this problem has been fixed for quite some time with Unix, Linux and Windows dynamically loadable libraries (dlls, dls) and the like.
March 19, 2019 | Posted By: Marty Abbott
Tim Berners-Lee and his colleagues at CERN, the IETF and the W3C consortium all understood the value of being stateless when they developed the Hyper Text Transfer Protocol. Stateless systems are more resilient to multiple failure types, as no transaction needs to have information regarding the previous transaction. It’s as if each transaction is the first (and last) of its type.
First let’s quickly review three different types of state. This overview is meant to be broad and shallow. Certain state types (such as the notion of View state in .Net development) are not covered.
The Penalty (or Cost) of State
State costs us in multiple ways. State unique to a user interaction, or session state, requires memory. The larger the state, the more memory requirement, the higher cost of the server and the greater the number of servers we need. As the cost of goods sold increase, margins decrease. Further, that state either needs to be replicated for high availability, and additional cost, or we face a cost of user dissatisfaction with discrete component and ultimately session failures.
When application state is maintained, the cost of failure is high as we either need to pay the price of replication for that state or we lose it, negatively impacting customer experience. As memory associated with application state increases, so does the memory requirement and associated costs of the server upon which it runs. At high scale, that means more servers, greater costs, and lower gross margins. In many cases, we simply have no choice but to allow application state. Interpreters and java virtual machines need memory. Most applications also require information regarding their overall transactions distinct from those of users. As such, our goal here is not to eliminate application state but rather minimize it where possible.
When connection state is maintained, cost increases as more servers are required to service the same number of requests. Failures become more common as the failure probability increases with the duration of any connection over distance.
Our ideal outcome is to eliminate session state, minimize application state and eliminate connection state.
But What if I Really, Really, Really Need State?
Our experience is that simply saying “No” once or twice will force an engineer to find an innovative way to eliminate state. Another interesting approach is to challenge an engineer with a statement like “Huh, I heard the engineers at XYZ company figured out how to do this…”. Engineers hate to feel like another engineer is better than them…
We also recognize however that the complete elimination of state isn’t possible. Here are three examples (not meant to be all inclusive) of when we believe the principle of stateless systems should be violated:
Shopping carts need state to work. Information regarding a past transaction - (add_to_cart) for instance needs to be held somewhere prior to check_out. Given that we need state, now it’s just a question of where to store it. Cookies are good places. Distributed object caches are another location. Passing it through the URL in HTTP GET methods is a third. A final solution is to store it in a database.
No sane person wants to wrap debits and credits across distributed servers in a single, two-phase commit transaction. Banks have had a solution for this for years – the eventual consistent account transaction. Using a tiny workflow or state machine, debit in one transaction and eventually (ideally quickly) subsequently credit in a second transaction. That brings us to the notion of workflow and state machines in general.
What good is a state machine if it can’t maintain state? Whether application state or session state, the notion of state is critical to the success of each solution. Workflow systems are a very specific implementation of a state machine and as such require state. The trick with these is simply to ensure that the memory used for state is “just enough”. Govern against ever increasing session or application state size.
This brings us to the newest cube model in the AKF model repository:
The Session State Cube
The AKF State Cube is useful both for thinking through how to achieve the best possible state posture, and for evaluating how well we are doing against an aspiration goal (top right corner) of “Stateless”.
The X axis describes size of state. It moves from very large (XL) state size to the ideal position of zero size, or “No State”. Very large state size suffers from higher cost, higher impact upon failure, and higher probability of failure.
The Y axis describes the degree of distribution of state. The worst position, lower left, is where state is a singleton. While we prefer not to have state, having only one copy of it leaves us open to large – and difficult to recover from – failures and dissatisfied customers. Imagine nearly completing your taxes only to have a crash wipe out all of your work! Ughh!
Progressing vertically along the Y axis, the singleton state object in the lower left is replicated into N copies of that state for high availability. While resolving the recovery and failure issues, performing replication is costly both in extra memory and network transit. This is an option we hope to avoid for cost reasons.
Following replication are several methods of distribution in increasing order of value. Segmenting the data by some value “N” has increasing value as N increases. When N is 2, a failure of state impacts 50% of our customers. When N is 100, only 1% of our customers suffer from a state failure. Ideally, state is also “rebuildable” if we have properly scattered state segments by a shard key – allowing customers to only have to re-complete a portion of their past work.
Finally, of course, we hope to have “no state” (think of this as division by infinite segmentation approaching zero on this axis).
The Z Axis describes where we position state “physically”.
The worst location is “on the same server as the application”. While necessary for application state, placing session data on a server co-resident with the application using it doubles the impact of a failure upon application fault. There are better places to locate state, and better solutions than your application to maintain it.
A costly, but better solution from an impact perspective is to place state within your favorite database. To keep costs low, this could be an opensource SQL or NoSQL database. But remember to replicate it for high availability.
A less costly solution is to place state in an object cache, off server from the application. Ideally this cache is distributed per the Y axis.
The least costly solution is to have the client (browser or mobile app) maintain state. Use a cookie, pass the state through a GET method, etc.
Finally, of course the best solution is that it is kept “nowhere” because we have no state.
The AKF State Cube serves two purposes:
- Prescriptive: It helps to guide your team to the aspirational goal of “stateless”. Where stateless isn’t possible, choose the X, Y and Z axis closest to the notion of no state to achieve a low cost, highly available solution for your minimized state needs.
- Descriptive: The model helps you evaluate numerically, how you are performing with respect to stateless initiatives on a per application/service basis. Use the guide on the right side of the model to evaluate component state on a scale of 1 to 10.
AKF Partners helps companies develop world class, low cost of operations, fast time to market, stateless solutions every day. Give us a call! We can help!
March 19, 2019 | Posted By: Pete Ferguson
The Crippling Cost of Distractions
Perhaps the biggest thief of progress is the misuse of time. Not talking about laziness here – I’m referring to what we often see with our clients – losing focus on what matters most.
Young startups are able to accomplish great things in a very short amount of time because there is rarely anything else to distract them from achieving success and everyone is focused on a shared outcome. Any money they have will likely be quickly exhausted. So food, sleep, vacations, and a life are all put on hold - or at least on the back burner - until products are built, tested, released, and adopted.
As a corporate survivor of 18 years at eBay, I’ve seen a few common themes within my own experience that are also exhibited at other companies where I’ve been involved in the Technical Due Diligence, three day technical architectural workshops, and longer term engagements where I’ve been embedded one of several technical resources. These distractions include:
- Supporting legacy applications at the cost of new innovation
Often M&A teams are highly focused on the short term – what can be accomplished this year or this quarter.
I’ve been involved in many due diligence activities as an internal consultant and as a third-party conducting an external technical due diligence. There is a lot of pressure on the M&A team to sell the deal, and in my experience I’ve seen the positives inflated and the negatives – well, negated, to get the deal done.
Often the full impact on the day-to-day operations team is either not considered or underplayed – if not optimistically overvalued as to how much work can get done in a week realistically over time.
Acquisition Distractions Development is Required to Resolve:
- Integration of Delivery (How to ship, security requirements, etc.)
- Process (Agile vs. Waterfall, what flavor of Agile, etc.)
- Product (Sharing of technologies and features)
An older example, but one that has renewed relevance with the recent focus of another activist investor is the story of eBay’s acquisition of Skype in Q4 of 2005. For shared resources within eBay, this was a huge distraction of already overly leveraged operations teams having to take on additional tasks. The distraction from eBay’s core mission was the true tragedy as it lost focus and opened a window for Amazon to grow 600% over the next 7 years (during one of the worst markets since the Great Depression) while eBay struggled to rebound its’ stock price. When eBay reemerged, Amazon had become a market leader in online commerce and has since pulled significantly ahead in many additional markets.
In our technical due diligences and three day architectural workshops, we can see how companies who were once lean and mean become quickly overwhelmed and distracted from innovation by trying to integrate a flood of acquisitions.
At winning companies, we see that M&A teams carefully evaluate:
- If acquisition will bring immediate ROI (PayPal immediately impacted eBay’s stock price positively and was able to sustain for several years, where Skype did the opposite)
- How much effort integrating the acquisition will require and build enough resources in the acquisition cost to accomplish a speedy induction
- How well the two companies’ cultures meld together
From an internal perspective in my time at eBay, the value Skype would bring wasn’t apparent, and externally most analysts were similarly scratching their heads. For Amazon, it was a godsend as it gave them an opportunity to bear down and focus and take a lot of market share from a sleeping and distracted giant.
Supporting Legacy Applications at the Cost of New Innovation
Legacy applications will keep every business from certain opportunities. We have seen multiple times where a company who was once a leader and innovator became too sales-focused and started squeezing out every last ounce of profitability from their legacy products. As a result, a large % of top developers were focused on building out new features into a declining market allowing a window of opportunity for new startups and competition to edge in.
Top of mind examples are Workday’s stealing of market share from SAP before they purchased Success Factors, or Apple’s edging into PC sales as they created a larger ecosystem of phones, tablets, laptops, TV boxes, etc.
Google is one of the best examples of being willing to dump legacy products - even successful ones - if they see support and maintenance as not keeping up with ROI or cutting into areas where they can innovate faster (Froogle, and Google+ and their capping of support of Chromebooks and Pixel phones at 5 years).
Reasons to sunset legacy products in favor of focusing efforts on developing the “next big thing:”
- Rising Maintenance and Talent Costs – Cost and scope creep happen suddenly and often are much greater than companies are willing to measure or to admit. One of the more informative questions we always ask is: “if you were to build a new product, what would it look like?” or “if you were to go out to market today, would you develop a new product the same way?” The answers are usually very revealing and conclusive - the legacy products have run their course.
- Longer Term Profitability – It takes a bold move to suspend the current cash cow in favor of focusing “all hands on deck” to develop the next big idea and future cash cow. Even when the potential future could be a 3, 4, or 5X revenue generator, companies get short-sighted and are only focused on this quarter’s revenue.
- Lost Opportunity Cost – Put aside your current quarterly projections for sales of legacy products, what is your current and potential competition building today that could greatly diminish your company in just a few years?
To compete with young, well-funded and high octane startups, legacy companies need to create a labs environment to attract top developers and free them from daily distractions of supporting legacy software to remain focused on emerging products and markets.
To succeed, teams must be focused on the highest possible ROI.
- Ensure acquisitions have been properly evaluated for the amount of time and energy it will take to integrate their products and that the cultures of your company and that being acquired are compatible.
- Supporting legacy applications is important for ongoing profitability, but likely can also be done by teams with less experience and expertise. What is often not factored properly is the opportunity cost of keeping legacy products in play long past their shelf life.
- Keep your A-teams focused on new product development, not on keeping the lights on. Create a separate Labs organization and treat them like a startup, and free them from legacy decisions on infrastructure, coding languages, etc. Your competition is starting with a fresh slate, you should too.
We’ve helped many startups and Fortune 500 companies make the transition from legacy software and hardware to SaaS and cloud infrastructure to increase scalability while providing high availability. Let us help your organization!
(Photo Credit: Matthew Henry downloaded from Burst.Shopify.com)
March 15, 2019 | Posted By: Marty Abbott
I’m no Nostradamus when it comes to predicting the future of technology, but some trends are just too blatantly obvious to ignore. Unfortunately, they are only easy to spot if you have a job where you are allowed (I might argue required) to observe broader industry trends. AKF Partners must do that on behalf of our clients as our clients are just too busy fighting the day-to-day battles of their individual businesses.
One such very concerning probability is the eventual decline – and one day potentially the elimination of – the colocation (hosting) business. Make no mistake about it – if you lease space from a colocation provider, the probability is high that your business will need to move locations, move providers, or experience a service disruption soon.
Let’s walk through the factors and trends that indicate, at least to me, that the industry is in trouble, and that your business faces considerable risks:
Sources of Demand for Colocation (Macro)
Broadly speaking, the colocation industry was built on the backs of young companies needing to lease space for compute, storage, and the like. As time progressed, more established companies started to augment privately-owned data centers with colocation facilities to avoid the burden of large assets (buildings, capital improvements and in some cases even servers) on their balance sheets.
The first source of demand, small companies, has largely dried up for colocation facilities. Small companies seek to be “asset light” and most frequently start their businesses running on Infrastructure as a Service (IaaS) providers (AWS, GCP, Azure etc.). The ease and flexibility of these providers enable faster time to market and easier operational configuration of systems. Platform as a Service (PaaS) offerings in many cases eliminate the need for specialized infrastructure and DevOps skill sets, allowing small companies to focus limited funds on software engineers that will help create differentiating experiences and capabilities. Five years ago, successful startups may have started migrating into colocation facilities to lower costs of goods sold (COGS) for their products, and in so doing increase gross margin (GM). While this is still an opportunity for many successful companies, few seem to take advantage of it. Whether due to vendor lock-in through PaaS services, or a preference for speed and flexibility over expenses, the companies tend to stay with their IaaS provider.
Larger, more established companies continue to use colocation facilities to augment privately-owned data centers. That said, in most cases technology refresh results in faster and more efficient compute. When the rate of compute increases faster than the rate of growth in transactions and revenue within these companies, they start to collapse the infrastructure assets back into wholly-owned facilities (assuming power, space, and cooling of the facilities are not constraints). Bringing assets back in-house to owned facilities lowers costs of goods sold as the company makes more efficient use of existing assets.
Simultaneously these larger firms also seek the flexibility and elasticity of IaaS services. Where they have new demand for new solutions, or as companies embark upon a digital transformation strategy, they often do so leveraging IaaS.
The result of these forces across the spectrum of small to large firms reduces overall demand. Reduced demand means a contraction in the colocation industry overall.
Minimum Efficient Scale and the Colocation Industry (Micro)
Data centers are essentially factories. To achieve optimum profitability, fixed costs such as the facility itself, and the associated taxes, must be spread across the largest possible units of production. In the case of data centers, this means achieving maximum utilization of the constraining factors (space, power, and cooling capacity) across the largest possible revenue base. Maximizing utilization against the aforementioned constraints drops the LRAC (long run average cost) as fixed costs are spread across a larger number of paying customers. This is the notion of Minimum Efficient Scale in economics.
As demand decreases, on a per data center (colocation facility) basis, fixed costs per customer increases. This is because less space is used, and the cost of the facility is allocated across fewer customers. At some point, on a per data center basis the facility becomes unprofitable. As profits dwindle across the enterprise, and as debt service on the facilities becomes more difficult, the colocation provider is forced to shut down data centers and consolidate customers. Assets are sold or leases terminated with the appropriate termination penalties.
Customers who wish to remain with a provider are forced to relocate. This in turn causes customers to reconsider colocation facilities, and somewhere between a handful to a majority on a per location basis will decide to move to IaaS instead. Thus begins a vicious cycle of data center shutdowns engendering ever-decreasing demand for colocation facilities.
Excluding other macroeconomic or secular events like another real estate collapse, smaller providers start to exit the colocation service industry. Larger providers benefit from the exit of smaller players and the remaining data centers benefit from increased demand on a dwindling supply, allowing those providers to regain MES and profitability.
Does the Trend Stop at a Smaller Industry?
We are likely to continue to see the colocation industry exist for quite some time – but it will get increasingly smaller. The consolidation of providers and dwindling supply of facilities will stop at some point, but just for a period. Those that remain in colocation facilities will either not have the means or the will to move. In some cases, a lack of skills within the remaining companies will keep them “locked into” a colocation. In other cases, competing priorities will keep an exit on the distant horizon. These “lock in” factors will give rise to an opportunity for the colocation industry to increase pricing for a time.
But make no mistake about it, customers will continue to leave – just at a decreased rate relative to today’s departures. Some companies will simply go out of business or contract in size and depart the data centers. Others will finally decide that the increasing cost of service is too high.
While it’s doubtful that the industry will go away in its entirety, it will be small and comparatively expensive. The difference between costs of colocation and costs to run in an IaaS solution will start to dwindle.
Risks to Your Firm
The risk to your firm comes in three forms, listed in increasing order of risk as measured by a function of probability of occurrence and impact upon occurrence:
- Pricing of service per facility. If you are lucky enough that your facility does not close, there is a high probability that your cost for service will increase. This in turn increases your cost of goods sold and decreases your gross margin.
- Risk of facility dissolution. There exists an increasingly high probability that the facilities in which you are located will be shut down. While you are likely to be given some advance notice, you will be required to move to another facility with the same provider, or another provider. There is both a real cost in the move, and an opportunity cost associated with service interruption and effort.
- Risk of firm as a going concern. Some providers of colocation services will simply exit the business. In some cases, you may be given very little notice as in the case of a company filing bankruptcy. Service interruption risk is high.
Strategies You Must Employ Today
In our view, you have no choice but to ensure that you are ready and able to easily move out of colocation facilities. Whether this be to existing data centers you own, IaaS providers, or a mix matters not. At the very least, we suggest your development and operations processes enable the following principles:
- Environment Agnosticism: Ensure that you can run in owned, lease, managed service, or IaaS locations. Ensuring consistency in deployment platforms, using container technologies and employing orchestration systems all aid in this endeavor.
- Hybrid Hosting: Operate out of at least two of the following three options as a course of normal business operations: owned data centers, leased/colocation facilities, IaaS.
- Dynamic Allocation of Demand: Prove on at least a weekly basis that you can operate any functionality within your product out of any location you operate – especially those that happen to be located within colocation facilities.
AKF Partners helps companies think through technology, process, organization, location, and hosting strategies. Let us help you architect a hybrid hosting solution that limits your risk to any single provider.
March 6, 2019 | Posted By: Marty Abbott
Our typical assessment goes something like this: We spend 1.75 days with an energized product team comprised of engineers and product managers. We feel the passion and engagement of the team, and we see the signs of stress the team endures in trying to meet product delivery schedules. Then we meet the security person. The person is not very stressed, does not have delivery goals and seems to steal the energy from the room.
This is an angry post. I won’t apologize for that. I’m fed up with the ridiculous way that most CISOs approach security, and you should be too. The typical approach, in more than 80% of the companies with which we work, results in slow time to market, increased response time for transactions, higher than necessary cost, lower than appropriate availability, and no demonstrable difference in the level of security related incidents. Put another way, most CISOs reduce rather than increase shareholder value.
Here is a handy tool to identify value-destroying CISOs. We’ve compiled 5 common statements uttered by CISOs out of touch with the needs of the corporation, customers and shareholders. Each of these assumes that the CISO both believes the statement and acts consistently with the statement (a high probability chance). Each statement is followed by why it is bad, what it is costing you, and what (besides replacing the person) you should do.
“No, we can’t do that”
Wrong answer. The purpose of security is to help move the company towards the right business outcome, as quickly as possible, with the right level of risk for the endeavor. This means that in some cases, where the probability and impact of compromise is low, we simply do not apply much “security” to the solution. In other cases, where probability and impact is high, we put measures in place to reduce probability and impact.
We never say “No” to an ethical outcome. Rather than saying “No” to a path, we attempt to ensure the path includes the right level of risk adjustment to make it successful.
The right answer: “That may work if we make a few modifications to help reduce the following probability of an incident, and reduce the impact of an incident should it occur. Here’s how my team can help you”.
“My job is to keep us out of the paper”
Incomplete, and as a result, incorrect answer. The role of security is to ethically maximize profits, by ensuring that risk is commensurate with the endeavor. A great security team helps decrease the probability of incidents and decrease the impact of an incident should one arise. This in turn helps ensure that profits achieve an appropriate level. “Keeping us out of the paper” makes no reference to the fiduciary responsibility of providing returns to shareholders. It’s further not a path to that responsibility, as there is no tie to enabling or maintaining profitability. Hell, if you want to achieve this goal, all you have to do is go out of business!
The right answer: “My job is to ensure an appropriate risk approach to stakeholder return – specifically through helping us to achieve an appropriate risk posture for our initiatives that meets our time to market, revenue and profitability objectives”.
“We have to work the process” or “Put in a request – we’ll review it”
Wrong answer. Security isn’t a “team” in and of itself because it can’t “score” and “win”. Security is part of a larger team responsible for delighting end customers such that we can ensure appropriate profitability through superior and appropriately secure offerings. To that end, security needs to adopt an agile mindset – specifically “individual interactions over processes and tools” and be embedded within the value creation teams that are the lifeblood of a company. Further, product and operational teams need to “own” and be accountable for the risk associated with the solutions they create and maintain. Software and servers need to be secure consistent with the needs of the business and end users.
The right answer: “Let’s get together immediately and make this work. Furthermore, how could I have ensured we had folks involved earlier to help you get this out faster?”
“My job is governance” or “We need the right governance model”
Wrong answer. The implication of the above statement is that the value the security team provides is in judging the work of others and ensuring compliance. The best security teams understand that compliance is best achieved through embedding themselves within product teams – not sitting in judgement of them during the process. The fastest and highest value creating teams are those that understand and have the right tools to accomplish the necessary outcomes embedded within their teams (read the related white paper here).
The right answer: “We embed people in teams to get the right answer quickly and get the product to market faster. Good governance happens in real time and in many cases can be automated within the product development lifecycle (CICD pipelines for instance).
“We have to slow things down a bit”
Wrong answer. If you have a compelling growth business with a big vision, you are going to attract competitors. If you have competitors, getting the right solution to market quickly is critical to your success. No one “wins” by playing only defense or by being just “careful”. You win by making the best risk measured decisions quickly and releasing a good enough product before your competitors.
The right answer: “We have to figure out how to make the right decisions early without slowing down our delivery.”
Another way to determine if you have the “right” security team and correct security leader is to evaluate the number of security related engineers embedded within teams relative to the number of people evaluating approaches or “governing”. If the number of “governing” employees exceeds the number of embedded employees, you have a problem. Ideally, we want a very small number of “brakes” (governance) and more security product “gas pedals” (embedded). The latter results in better decisions and better product security in real time. The former results in delay, overhead, and an ivory tower.
We perform dozens of security assessments and technical due diligence reviews every year. Contact us and let us help!
‹ First < 3 4 5 6 7 > Last ›