July 11, 2019 | Posted By: Pete Ferguson
A few years back, a senior leader at a previous company decided that being “cutting edge” was a core value we should espouse and so we quickly became the Guinea Pigs of our vendors. I don’t recall them offering much, if any, of a discount, but they gladly pushed their V1 software to us and we were given a fast track feedback channel to the developers.
When employees weren’t able to get the needed access to our facilities and several service interruptions occurred, thankfully uptime became our top priority and we settled into V2 of stable release software instead.
Using proven mature technology means finding that sweet spot after the beta testers and before the end of a product development lifecycle (PDLC).
“How can we leverage the experiences of others and existing solutions to simplify our implementation? ... The simplest implementation is almost always one that has already been implemented and proven scalable.”
Abbott, Martin L.. Scalability Rules. Pearson Education.
At the release of the first and second generation of the Apple iPad, I was one of the crazy fanboys who woke up in the middle of the night to camp out in front of the Apple Store to be one of the first in line. Because the incremental improvements in speed and features were substantial, I would quickly want to upgrade to the next version.
In contrast, my kids still use Generation 3 and 4 iPads bought used on Craigslist and eBay, and the devices work great for basic web surfing and streaming videos. I waited until V2 of the iPad Pro for my own tablet which I also bought used and at a substantial discount. The incremental improvements of each new generation have become more and more arbitrary for how I use the devices, and the ROI for these latter devices is at least 2-3 fold greater.
In the world of Microsoft software, SP2 is usually that Goldilocks time period where there is enough benefit to upgrade without taking on too much risk for larger organizations. As of this writing, NetMarketShare.com reports that Windows 7 is still being used by 38% of polled users and Windows XP is not too far behind Windows 8 for a combined total of just under 8%. Clearly, many – including non-profit and third-world countries – want the stability of a proven OS without the hassle and cost to upgrade.
The Sharpness of Cutting Edge
It is helpful to consider the following when making a decision about just how close to the “cutting edge” technology utilized should be:
- Scalability: Will the technology support growth spurts without worry of capacity?
- Availability: Reliable in the 99.9s with a large enough customer base that if you are experiencing issues, so is a much larger segment of the market and the vendor is incentivized to quickly patch and resolve
- Competitive Advantage: Your company is propelled to success, not hindered by the technology
If your decision to be cutting edge is not providing an easily-observable competitive advantage, then the risks are not bringing the needed rewards.
As in all decisions, we need to constantly be asking what is our desired outcome? And then regularly review if our priorities match our purchase and implementation of technology decisions. Bells and whistles are often underutilized and provide very little competitive or strategic advantage while increasing costs and providing distractions.
TECHNOLOGY ADOPTION LIFECYCLE
Use Mature Technology – not retired or dying technology
The corollary, of course, is not to get too comfortable that your technology becomes antiquated.
As technology ages, vendors increase the cost of support in hopes of incentivizing us to upgrade to the latest and greatest or move to a subscription-based model. Perhaps the largest cost to our business, however, is the loss of competitive advantage when precious resources are used to nurse aged systems, attend P1/Sev 1 escalation calls, post mortems, and quality of service meetings instead of developing the next-gen platform that will maintain or beat the competition for market share.
The largest cost to our business is the loss of competitive advantage when precious resources are used to nurse aged systems instead of developing the next-gen platform to increase and maintain market share.
Don’t Hitch Your Wagon to a Dying or Dead Horse
As an extreme example of holding on to dying technology – many traditional banks, insurance companies, airlines, and others still use massive, monolithic applications written in COBOL. Outside of cost to change to a new platform, a justification we have heard from our clients to keep it alive is that COBOL programmers can make more because there are so few of them.
The higher pay, however, does not seem to drive new University interns and graduates who want to be part of the next Facebook, Google, or Apple to tie their future career to a dying language. The majority of COBOL programmers are currently retiring and will likely be out of the workforce within a decade which means the costs will continually increase until the supply is depleted. While it is not an immediately dire need to move out of COBOL, it definitely needs to be on the 3-5 year roadmap of how to transition to a more scalable, efficient, and easier to support language. For many organizations, it is like the ugly couch in a college apartment, no one really notices it is more than just an eyesore until a family of rats and lice are found to be living in it.
Unfortunately, with quarterly financial cycles, often the cost to move away from servers and software that have performed reliably for decades is a difficult cost proposition for operations teams. FAs – who’s bonus and annual increase are reliant upon keeping budgets flat – aren’t going to see the advantages. CTOs must take a longer term view of the cost to transition now verses down the road - and give a very heavy weighting to lost opportunity cost for not making the move sooner rather than later.
- Our technology solutions should always focus on what will lead to the most competitive take of market share
- Our technology must be both scalable and highly available
- There is a sweet spot to technology adoption that lies between early release and end of life that should be evaluated annually and the true cost of supporting beta and aged systems should be measured in both man hours for upkeep – but most importantly in how much it distracts our key talent from innovating the next BIG THING
We help hundreds of organizations globally find the right balance in how technology is utilized, and how technology ingrates with your people and process. Contact us today to see how we can help!
July 10, 2019 | Posted By: Bill Armelin
We are surprised at how often we go into a client and find that management does not have any metrics for their teams. The managers respond that they don’t want to negatively affect the team’s autonomy or that they trust the team to do the right things. While trusting your teams is a good thing, how do you know what they are doing is right for the company? How can you compare one team to another? How do you know where to focus on improvements?
Recently, we wrote an article about team autonomy, discussing how an empowered team is autonomous within a set of constraints. The article creates an analogy to driving a car, with the driver required to reach a specific destination, but empowered to determine WHAT path to take and WHY she takes it. She has gauges, such as a speedometer to give feedback on whether she is going too fast or too slow. Imagine driving a car without a speedometer. You will never know if you are sticking to the standard (the speed limit) or when you will get to where you need to go (velocity).
As a manager, it is your responsibility to set the appropriate metrics to help your teams navigate through the path to building your product. How can you hold your teams to certain goals or standards if you can’t tell them where they are in relation to the goal or standard today? How do you know if the actions you are taking are creating or improving shareholder value?
What metrics do you set for your teams? It is an important question. Years ago, while working at a Big 6 consulting firm, I had the pleasure of working with a very astute senior manager. We were redesigning manufacturing floors into what became Lean Manufacturing. He would walk into a client and ask them what the key metrics were. He would then proceed to tell them what their key issues were. He was always right. With metrics, you get what you measure. If you align the correct metrics with key company goals, then all is great. If you misalign them, you end up with poor performance and questionable behaviors.
So, what are the right metrics for a technology team? In 2017, we published an article on what we believe are the engineering metrics by which you should measure your teams. Some of the common metrics we focused on were velocity, efficiency, and cost. At initial glance, you might think that these seem “big brother-ish.” But, in reality, these metrics will provide your engineering teams with critical feedback to how they are doing. Velocity helps a team identify structural defects within the team (and should not be used to compare against other teams or push them to get more done). Efficiency helps the teams identify where they are losing precious development time to less valuable activities, such as meetings, interviews and HR training. It helps them and their managers quantify the impact of non-development and reduce such activities.
Cost helps the team identify how much they are spending on technology. We have seen this metric particularly used effectively in companies deploying to the cloud. Many companies allow cloud spending to significantly and uncontrollably increase as they grow. Looking at costs exposes things like the need for autoscaling to reduce the number of instances required during off peak times, or to purge unused instances that should be shut down.
The key to avoiding metrics from being perceived as overbearing is to keep them transparent. The teams must understand the purpose of the metric and how it is calculated. Don’t use them punitively. Use them to help the teams understand how they are doing in relation to the larger goals. How do you align the higher-level company goals to the work you teams are performing? We like to use Objectives and Key Results, or OKRs. This concept was created by Andy Grove at Intel and brought to Google by John Doerr. The framework aims to align higher level “objectives” to measurable “key results.” An objective at one level has several key results. These key results become the objectives for the next level down and defines another set of key results at that level. This continues all the way down to the lowest levels of the company resulting in alignment of key results and objectives across the entire company.
Choosing the Right MetricMetrics-driven institutions demonstrably outperform those that rely on intuition or “gut feel.” This stated, poorly chosen metrics or simply too many metrics may hinder performance.
- A handful of carefully chosen metrics. Choose a small number of key metrics over a large volume. Ideally, each Agile team should be evaluated/tasked with improving 2-3 metrics (no more than 5). (Of note, in numerous psychological studies, the quality of decision-making has actual been shown to decrease when too much information is presented).
- Easy to collect and or calculate. A metric such as “Number of Customer Service Tickets per Week” although crude, is better than “Engineer Hours spent fixing service” as it requires costly time/effort to collect.
- Directly Controllable by the Team. Assigning a metric such as “Speed and Accuracy of Search” to a Search Service is preferred to “Overall Revenue” which is less directly controllable.
- Reflect the Quality of Service. The number of abandoned shopping carts reflects the quality of a Shopping Cart service, whereas number of shopping cart views is not necessarily reflective of service quality.
- Difficult to Game. The innate human tendency to game any system should be held in check by selecting the right metrics. Simple velocity measures are easily gamed while the number of Sev 1 incidents cannot be easily gamed.
- Near Real Time Feedback. Metrics that can be collected and presented over short-time intervals are most desirable. Information is most valuable when fresh — Availability week over week is better than a yearly availability measure.
Managers are responsible for the performance of their teams in relation to the company’s objectives and how they create shareholder value. Measuring how your teams are performing against or their contribution to those goals is only speculation if you don’t have the correct measurements and metrics in place. The bottom line is, “If you are not measuring, you are not managing.”
Are you having difficult defining the right metrics for your teams? Are you interested in defining OKRs but don’t know where or how to get started? AKF has helped many companies identify and implement key metrics, as well as implementing OKRs. We have over 200 years of combined experience helping companies ensure their organizations, processes, and architecture are aligned to the outcomes they desire. Contact us, we can help.
July 8, 2019 | Posted By: Marty Abbott
Circuit Breaker Pattern Overview
The microservice Circuit Breaker pattern is an automated switch capable of detecting extremely long response times or failures when calling remote services or resources. The circuit breaker pattern proxies or encapsulates service A making a call to remote service or resource B. When error rates or response times exceed a desired threshold, the breaker “pops” and returns an appropriate error or message regarding the interface status. Doing so allows calls to complete more quickly, without tying up TCP ports or waiting for traditional timeouts. Ideally the breaker is “healing” and senses the recovery of B thereby resetting itself.
The circuit breaker analogy works well in that it protects a given circuit for calls in series. Unfortunately, it misses the true analogy of tripping to protect the propagation of a failure to other components on other circuits. We often use the term circuit breaker in our practice to refer to either the technique of fault isolation or the microservice pattern of handling service to service faults. In this article, we use the circuit breaker consistent with the microservice meaning.
Problems the Circuit Breaker Fixes
Generally speaking, we consider service to service calls to be an anti-pattern to be avoided whenever possible due to the multiplicative effect of failure and the resulting lowering of availability. There are, however, sometimes that you just can't get around making distant calls. Examples are:
- Resource (e.g. database) Calls: Necessary to interact with ACID or NoSQL Solutions.
- Third Party Integrations: Necessary to interact with any third party. While we prefer these to be asynchronous, sometimes they must be synchronous.
In these cases, it makes sense to add a component, such as the circuit breaker, to help make the service more resilient. While the breaker won't necessarily increase the availability of the service in question, it may help reduce other secondary and tertiary problems such as the inability to access a service for troubleshooting or restoration upon failure.
Principles to Apply
- Avoid the need for circuit breakers whenever possible by treating calls in series as an anti-pattern.
- When calls must be made in series, attempt to use an asynchronous and non-blocking approach.
- Use the circuit breaker to help speed recovery and identification of failure, and free up communication sockets more quickly.
When to use the Circuit Breaker Pattern
- Useful for calls to resources such as databases (ACID or BASE).
- Useful for third party synchronous calls over any distance.
- When internal synchronous calls can't otherwise be avoided architecturally, useful for service to service calls under your control.
The circuit breaker won't fix availability problems resulting from a failed service or resource. It will make the effects of that failure more rapid which will hopefully:
- Free up communication resources (like TCP sockets) and keep them from backing up.
- Help keep shared upstream components (e.g. load balancers and firewalls) from similarly backing up and failing.
- Help keep the failed component or service accessible for more rapid troubleshooting and alerting.
- Always ensure to have alerts fired on breaker open situations to help aid in faster time to detect (TTD).
AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures. Give us a call – we can help!
July 3, 2019 | Posted By: Marty Abbott
AKF Partners has helped guide companies through digital transformations for over 10 years. We’ve helped traditional brick and mortar service, product, banking and retail companies create compelling sail solutions to harness the increasing power of the prevailing digital winds. This experience has made it clear that no transformation can be successful without addressing 10 key areas. These 10 areas form the foundation of any successful transformation, and a failure to address any of them is at the very least a guarantee to have a slow and painful transformation. In most cases, a failure to address even one of them is a guarantee to fail in the transformation and as a result, fail as a company.
Without further ado, we offer our list of 10 must have principles for any digital transformation.
Every successful transformation starts and ends with people; people who have the right experience, the right mindset, the right approach, and a sizable amount of humility.
- Right Skills, Right Behaviors, Right Experience, Right Time
You need people with experience in the digital world; people who understand that time is of the essence and behave appropriately. Think of it this way: if you were having a surgeon perform a procedure to save your life, do you want a surgeon who is learning on the job? At the very least, you’ll want to make sure that an experienced surgeon is alongside the inexperienced doctor. The same is true for digital transformations.
- Product not IT Mindset
Building solutions for end consumers is a very different world than building solutions for employees. As we indicate in the Art of Scalability, and as Marty Cagan agrees, you must have a product, not an IT mindset, to be successful.
The driving forces behind how one creates a product are different. Product teams look at revenues and profits instead of just costs. Funding outcomes are more important than funding projects. Product teams lead whereas IT teams take orders. Product teams think first about performance, rather than speaking nonstop about process. Governance, while important, takes a back seat to execution – especially executing against measurable outcomes. Collaboration trumps the negotiation between IT teams and business units.
Similarly, the leader of a product engineering organization is different than the leader of an IT team:
As we will discuss later under “discovery”, product teams know that they will be wrong often. We eliminate words like “requirements”, because they seem to indicate we completely understand what needs to happen. Humble teams understand the outcome, not the path. The path is initially a set of hypotheses that are tested, validated, or proven wrong and discarded in favor of new hypotheses. The only thing we know is that we will be wrong multiple times before finally landing on a result that meets the business outcomes.
- Outcomes not Projects
As stated above, Digital teams look to fund outcomes – not projects. Think in terms of $XM invested in search for %Y increase in add-to-carts from search resulting in $ZM revenue. Not “Implement Elastic Search”.
The right people will ensure that you follow the processes and approaches that will make you successful – specifically:
- Discovery and Agility
Digital companies attempt to find the right solution through Discovery. Starting with an MVP (below), they iterate in an Agile fashion to find the solution. Gone are large solution specifications (software requirements specifications) in favor of epics and stories which can be clearly and easily defined in significantly fewer words.
Product teams understand what Fred Brooks meant when he said
Because the design that occurs first is almost never the best possible, the prevailing system concept may need to change. Therefore, flexibility of organization is important to effective design.
Fred is talking about the need to be agile – to sense and respond to not only the mistakes made with any solution development, but the interaction of the market with the solution you develop. You must throw away waterfall concepts in your digital endeavors to be truly successful.
When you develop solutions for employees, you have a captive audience and a natural monopoly. You are paying them, and you get to determine whether they use your solution or not. As a result, you don’t have to be great at solution usability. That isn’t the case in the new digital world. End users will churn or select another provider if a solution isn’t easy to use and intuitive.
- Gas Pedals – Not Brakes
This is a catch-all category for all the reasons traditional IT teams have for why something can’t be done: “We have to work the process ...” or “This needs to go through a review ...” or “Has it gone through the right governance processes?” or “Have you filed a ticket for this yet?” Product teams care about speed and time to market. As such, the best product teams have all the skills in them, and the correct experience within the team to be fully empowered and held accountable to achieving the right outcomes.
Everyone is using the term minimum viable product (MVP) these days. Few companies truly get it. The pork belly political negotiation that typically happens to get an IT project launched results in bloated, overpriced and slow time to market solutions. Digital companies know that small is fast. They know that they’d rather build smaller than the initial need and work into a true MVP than over-shoot it and as a result be late.
The era of a company relying on the brilliance of a handful of people to predict markets is over. Companies that have completed the digital transformation sense the needs of the market and customers in real time through science – not individual brilliance.
- Learning not Knowing (Induction and Deduction)
Digital teams assume that their hypotheses may be wrong, and they know that what may be correct today for a solution will likely evolve and change soon. As such, they build solutions that help them identify evolving patterns and learn true end user need. Critical to any Digital transformation is a data ecosystem that goes well beyond the packaged “data warehouses” of yore. You need more than just a place to dump your data – you need a solution that supports a virtuous cycle of learning and exploration. Induction leads to insights to form hypotheses. Deduction proves (or disproves) those hypotheses to create new knowledge that fuels growth.
- Scientists and Engineers – not Technicians
Reports and reporting teams may provide executives with the daily pulse of their business, but they are insufficient to fuel the insights necessary to be competitive in the digital world. Report writers are at best programmers, and you need people who understand how to use data to that generate insights and result in knowledge and information. More than just trends, Digital teams need to understand how and why trends happen and even more importantly must be able to predict what the future brings.
Need help with your digital transformation? Contact us, we can help!
July 3, 2019 | Posted By: AKF
For many years after the introduction of the automobile, most industries benefited from somewhat static and predictable secular forces. Nascent technology forces primarily aided companies in increasing gross margins and operating margins by lowering the cost of labor, decreasing the cost of warehousing and decreasing logistical costs. Consumer behavior followed somewhat predictable and unchanged patterns, varying mostly with economic conditions and seasonal needs.
However, the advent of eCommerce and the “anything as a service” movement in the late 90’s through early 2000’s started to significantly change the behavior of individual and business consumers. Layer on logistics integrations that allowed near immediate gratification for non-service durable goods and consumers started to shift to purchasing from home. Similarly, businesses enjoy comparatively easy near-term gratification with the implementation of services; gone are the days of multi-year on-premise ERP implementations in favor of short-duration leasing of software as a service.
Late majority and laggard businesses within the technology life cycle were as late to identify these shifts in business and individual consumer behavior as they were to adopt new technical solutions. While they may have thrown up digital storefronts or offered digital downloads of their solutions, they failed to envision how the nascent forces begged for new integrations and tighter business cycles that would benefit not only the buyer but the producing company as well.
So, what is Digital Transformation? It is taking digital technologies and using them to provide new and creative ways to conduct business. This isn’t just an update to a technology stack or developing a new code. It is a transformation, via digitization, of how business is done.
Recaptcha and the NY Times
In the mid 2000’s Captcha was taking the internet by storm. It was designed to weed out bots and only allow humans through websites. It started out rather simply with a challenge of words or letters that required feedback that, at the time, would be difficult for a computer to detect. It met with great success, logging over half a million hours per day of people filling out captchas to access sites or submit information. Although not a digital transformation itself, what was coupled with captcha became a transformation.
The NY Times had digitized all of their old issues and used computer software to recognize the different images and convert to text. However, 10 to 30 percent of the text was still missed. This then required two humans to go through the unrecognized text and come to the same conclusion about what was shown. A lengthy and expensive process. Partnering with the captcha team, the NY Times put those unrecognized words at the end of captchas. There would be a normal captcha question (still designed to determine whether a bot or not) followed by a picture of text from the NY Times. Once the picture had been verified by enough people, that picture would then be associated with that text. Using this resource of over half a million man hours per day, the NY Times was able to quickly close the gap on unrecognized text from older issues. This generated an increase in business for captcha and solved a manual problem digitally.
Who uses Digital Transformation and Why?
Startups are created digitally. They have no infrastructure or code or business model that existed previously. Instead of transformation, startups are born digital. This gives them the opportunity to carve out a piece of the total addressable market currently not being serviced, or as a way to siphon off business from other companies currently operating in the market. By looking at what is already being done in the market and identifying the dissatisfaction within in, smart engineers can create a new product and a new business (if done appropriately, well-funded as well) that dramatically transforms a technology sector. Digitization can come easy to startups. The business already exists and trying to compete by doing the same thing will lead to failure. Taking an antiquated method and using digital prowess to transform it is what gives startups their edge.
If the current competitors in the market are paying attention, they can use this disruption to their advantage. Just like startups were able to glean information about where and how to target from larger corporations, those same corporations can now identify how to transform from the startups. It is not always easy creating a disruptive technology that requires a transformation for how business is done. By being a close follower, large corporations can avoid a lot of the pitfalls and ensure that they can keep their market share, while also now targeting new consumers. Conversely, if unwilling to adapt to changing forces, or unwilling to see the future for what it is, these corporations will be abandoned by their consumers.
Digital Transformation is Not…
...just taking a current generation technology and updating it.
...version control or the introduction of new hardware or software.
Digital Transformation requires a complete re-tooling of how you conduct business. It will affect how you code, how you scale, how you sell, how you brand and even how you interact as a company. A caterpillar doesn’t become a butterfly simply by slapping on some wings. Inside the cocoon it completely dissolves itself and rebuilds from the ground up. That is Digital Transformation.
Digital Transformation isn’t easy. Marty Abbott has summed up 10 Principles that will assist you with your transformation. If you need further assistance with identifying Digital Transformation pitfalls and goals, AKF can help!
July 1, 2019 | Posted By: Greg Fennewald
As technology professionals, managing risk is an important part of the value we provide to the business. Risk can take many forms, including threats to availability, scalability, information security, and time to market. Physical layer risks from the data center realm can severely impact availability, as the events of the February 2019 Wells Fargo outage demonstrate.
Transitioning Away from On-Prem Hosting
Over the last decade, knowledge of data center architecture, operating principles, capabilities, and associated risks has decreased in general due to the rise of managed hosting and especially cloud hosting. This is particularly true for small and medium sized companies, which may have chosen cloud hosting early on and thus never have dealt with colocation or owned data centers. This is not necessarily a bad trend – why devote resources to learn domains that are not core to your value proposition?
While knowledge of data center geekdom may have decreased, the risks associated with data centers has not substantially changed. Even the magic pixie dust of cloud hosting is a data center at its core, albeit with a degree of operational excellence exceeding the stereotypical company-owned data center + colo combination.
Given that technologists can mitigate data center risks by choosing cloud hosting with a major provider capable of mastering data center operations, why spend any time to learn about data center risks?
- Cloud hosting sites do encounter failures. The ability to ask informed questions during the vendor selection process can help optimize the availability for your business.
- Business or regulatory changes may force a company to use colocation to meet data residency or other requirements.
- A company may grow to the size where owning data centers makes business sense for a portion of their hosting need.
- A hosting provider could exit the business or face bankruptcy, forcing tenants to take over or move on short notice. Been there, done that, got the T shirt.
Data Center Lifespan Risk
For the purposes of this article, we will consider data center lifespan risk. We define this risk as the probability of an infrastructure failure causing significant, and possibly complete, business disruption and the level of difficulty in restoring functionality.
A chart of data center lifespan risk resembles a bathtub – a high level of failures as the site is first built and undergoing 5 levels of commissioning towards the left side of the chart, followed by a long period of lower threat that can extend 15 years or more. As time continues to march on, the risk rises again, creating the right-hand side of the bathtub curve.
The risk of failure increases over time as the useful service life of infrastructure components approach their end. The risk of failure approaches unity over a sufficiently long-time span.
Service Life Examples
Below are some service life estimates based on our experience for critical data center components that are properly maintained;
||4 years VRLA, 12+ wet cell
||Battery string monitoring strongly recommended
||12,000+ hours before overhaul, run 100 or less annually
|Main switchgear PLC
||15 + years
||PLC model EOL is the risk
|CRAH/CRAC fan motors
||The magic smoke wants to escape
|Galvanized cooling tower wet surfaces
||Varies with water chemistry, stainless steel worth the cost
|Electrical distribution board
||EOL of breaker style and PLC is the risk
|Chilled water piping
||Design for continuous duty, ~ 7 FPS flow velocity
All the above examples are measured in years. If you are in the early years of a data center lifespan, there’s not a lot to worry about other than batteries. Most growing companies are more concerned about adequate capacity, availability, and cost when they create their hosting strategy. Not much thought is given to an exit strategy. Such an effort is probably not worth it for a startup company, but established companies need to be thinking beyond next quarter and next year.
If your product or service can survive the loss of a single hosting site without impact (i.e. multi-active sites with validated traffic shifts), you could afford to run a bit deeper into the service life timeline. If you can’t - or, like Wells Fargo thought you could but learned the hard way that was not the case - you need to plan ahead to mitigate these risks.
As mentioned before, the risks we want to mitigate are an impactful failure and a complex restoration after a failure. By complex, we mean trying to find parts and trained technicians for components that were EOL 5 years ago and end of OEM support 18 months ago. Not a fun place to be. Would you feel comfortable running your online business with switches and routers that are EOL and EOS? Hopefully not. Why would you do so for your hosting location?
Mitigating the Risks
The best way to mitigate the risk of an impactful infrastructure failure is to be able to survive the loss of a hosting site regardless of type with business disruption that is acceptable to the business and customers. That could vary, your hosting solution should be tailored to the needs of the business.
Some thoughts on aging hosting sites;
- All the characteristics that make cloud hosting taste great and be less filling (containerization, automation, infrastructure as code, orchestration, etc.) can also make the effort to stand up a new site and exit an old one much less onerous.
- If you are committed to an owned data center or colo, moving to a newer site is the best choice. Could you combine a move with a tech refresh cycle? Could the aging data center fulfill a different purpose such as hosting development and QA environments? Such environments should have less business impact from a failure, and you can squeeze out the last few years of life from that site.
- You can purchase extra spare parts for components nearing EOL or EOS and send technicians to training courses. This can mitigate risk but is really analogous to convincing yourself that you can scale your DB by tuning the SQL queries. Viable only to add 6 or 12 months to a move/exit timeline.
Just about any of the components mentioned above in the useful life estimate can be replaced, especially if the data center can be shut down for weeks or months to make the replacement and test the systems. Trying to replace components while still serving traffic is extremely risky. Very few data centers have the redundancy to replace electrical components while still providing conditioned power and cooling to the server rooms. Those sites that can usually cannot do so without reducing their availability. We’ve had to take a dual UPS (2N) site to a single UPS source (N) for a week to correct a serious design flaw. Single corded is not appropriate if your DR plan checks an audit box and not much else
The tremendous popularity of cloud hosting does not alleviate the need to understand physical layer risks, including data center lifespan risks. Understanding them enables technology leaders to mitigate the risks.
Interested in learning more? Need assistance with hosting strategy? Considering a transition to SaaS? AKF Partners can help.
July 1, 2019 | Posted By: Pete Ferguson
You wouldn’t (hopefully) think of building a house without first sitting down with an architect to come up with a good plan.
While building a house is a waterfall process, that doesn’t mean we can throw out good architecture when moving to an Agile methodology in software development. Sound architectural principles allow team autonomy to select and ensure that any new significant design meets standards for high availability and scalability of your website, product, or service.
Good architectural principles ensure stability, compatibility, and reliability. Many post mortems after a major incident with which I’ve been involved have uncovered root causes resulting from teams not following agreed-upon architectural principles – and unfortunately more often than not – seeing that teams do not have written and followed architecture standards.
Architectural Principles Are Guidelines for Success
Often in Agile software development, we confuse the desire to innovate and do things in new and differentiating ways with a similar antagonist notion that we shouldn’t be shackled with rules and procedures. Many upstarts and smaller companies offer freedom from layers of policy and red tape that stifle out speed, time to delivery, and innovation – and adding in architectural standards and review boards can sound very bureaucratic.
Architectural Principles are not meant to be restrictive. When written and executed properly, they are meant to aid growth and ensure future success. Structured architecture should keep things simple, expandable, and resilient and help teams establish autonomy rather than anarchy.
Successful companies are able to balance consistency with speed, which ensures future efforts aren’t encumbered by bugs, having to refactor excessive amounts of code, or be haunted by the sins of past shortcuts. The key is to keep things simple and dependable!
… match the effort and approach to the complexity of the problem. Not every solution has the same complexity—take the simplest approach to achieve the desired outcome.
(Abbott, Martin L. Scalability Rules. Pearson Education. Kindle Edition.)
AKF Architectural Principles
On the many technical due diligence and extended workshops I’ve attended in my tenure with AKF, I’ve seen how successful companies comply – and struggling companies avoid – the following principles:
Everything we develop should be based on a set of architectural principles and standards that define and guide what we do. Successful software engineering teams employ architectural review boards to meet with teams and review existing and planned systems to ensure principles are being followed. Extremely successful companies have a culture that constantly looks at ways to better implement their agreed-upon architectural principles so that review boards are only a second set of eyes instead of a policing force.
With 12 years of product architecture and strategy experience, AKF Partners is uniquely positioned to be your technology partner. Let us know how we can help your organization.
This is the first of a series of articles that will go into greater depth of each of the above principles:
Image Credit - Pexels.com
July 1, 2019 | Posted By: Pete Ferguson
This is one of several articles on recommended architectural principles and goes into deeper depth to our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” or more commonly – “Swim lanes” or “Swim-laned Architectures”. We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of the said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:
- Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
- Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of failure is contained. As such, and depending upon approach, only a portion of users or a portion of the functionality of the product is affected. This is akin to circuit breakers in your house – the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker tripping, preserving power to devices which are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations, similar to how a floating line of buoys keeps each swimmer in his or her lane during a race. In order to achieve this in systems architecture, ideally there are no synchronous calls between swimlanes or failure domains made pursuant to a user request.
User-initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with an appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and web servers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with the AKF Scale Cube in the past, there are many ways in which to think about swimlaned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-Axis would mean replicating everything for high availability – and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost-effective solutions for recovering from a disaster.
Fault Isolation in the Y-Axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) with each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to its swim lane.
The example above of a commerce site shows different components of the page broken down into sections for login, buy again, promotions, shopping cart, and checkout. Each component would reside within separate applications, hosted on different servers with properly isolated services.
Another approach would be to perform a separation of your customer base or separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z-Axis swimlane along customer, order number, or product ID lines. More beneficially, if we are interested in the fastest possible response times to customers, we may split along geographic boundaries with each pointing to the closest data center within that region. Besides contributing to faster customer response times, these implementations can also help ensure we are compliant with data sovereignty laws (GDPR for example) unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault-isolated swim lane solutions. Send us a note - we’d love to help you!
June 28, 2019 | Posted By: Roger Andelin
The two most common types of technology due diligence requests we see at AKF are
- Product Technology Due Diligence
- Information Technology, or IT, Due Diligence.
Both types are very different from each other, but often get confused. This article will explain the differences between Product Technology and
Information Technology – and why understanding the difference is critical to a company’s success and profitability.
An IT department is typically led by a Chief Information Officer (CIO). The focus of the CIO is on information technology that supports the ongoing operations of the business. The CIO and the IT team’s key outcomes are typically around employee productivity and efficiency, applying technology to improve productivity and to lower costs. This includes technologies that run the financial and accounting systems, sales and operations systems, customer support systems, and the networks, servers, and storage underlying these systems. The CIO is also responsible for the technologies that are used in the office such as email, chat, video conferencing systems, printers, and employees’ desktop computers.
Conversely, Product Technology (or Digital Product Technology) is typically led by a Chief Technology Officer (CTO). The focus of the CTO is building a product or service for customers out of software and running that product or service on cloud systems or company-owned systems, although the latter is becoming less common. Put another way, CTOs build and run software as revenue generating products and services.
Whereas the CIO runs a cost center and is responsible for employee productivity, the CTO is responsible for revenue and cash-flow. Sales growth, time to market, costs of goods sold, and R&D spend are some of the factors included within key outcomes for the CTO.
For example, if you were running a newspaper business, your primary product is the news. However, you also must build applications to read news like mobile apps and web apps. It is the job of your CTO to build, maintain, and run these apps for your customers. Your CTO would be accountable for business metrics – such as the number of downloads, users, and revenue. If your CTO is distracted by CIO issues of running the day-to-day business of the office, they are being taken away from their work to build and implement the revenue generating products and services your company is trying to create.
Product Technologies and IT Technologies are very different. CIOs and CTOs have very different skills and competencies to manage these differences. For example, a CIO often possesses deep knowledge of back-office applications such as accounting systems, finance systems, and warehouse management systems. In many cases they likely rose up through the technology ranks writing, maintaining, and running those systems for other departments in the company. They are excellent at business analysis, collecting requirements from company users and translating those requirements into project plans. CIOs are often proficient at waterfall development methodologies often used to implement back-office applications.
The IT team is often largely staffed with people who know how to integrate and configure third-party products, with a small amount of custom development. The opposite is true for most product technology teams typically staffed with software engineers who are building new solutions and a smaller number of engineers integrating infrastructure components.
CTOs possess entirely different technology skills needed to build and maintain software as a service (SaaS) for the company’s customers. CTOs possess the skills to architect software applications that are scalable as the company grows its customer base. The AKF Scale Cube is an invaluable reference tool for the CTO building a scalable software solution based on scalable microservices. CTOs must have the skills to run product teams, including user experience design, along with software development. CTOs are more likely to be proficient in Agile development methodologies, such as Scrum. CTOs are expected to know the product development lifecycle (PDLC), mitigate technical debt liability, and know how to build software release pipelines to support continuous integration and delivery(CI/CD).
What We See In The Wild
Regularly, when AKF is called to perform technology due diligence, we often find that Product and IT are combined! This has been especially true for older, established companies, with traditional IT departments. CIOs took on the responsibility of building and running customer-facing internet and mobile software products and services rather than creating a separate Product Development Team under a CTO. The results are often not positive because the technical, product, and process skills are very different between the two.
This mistake is not exclusive to older companies. We see startup CEO’s making the same mistake, often under the rationale of reducing burn. However, when a startup CEO looks to the CTO to help set up new employees’ desktop computers or to fix a problem with email, it is a huge distraction for the CTO who should be focused on building and improving the company’s revenue-generating products.
AKF recommends that CEOs not combine Product Development and IT departments. Understanding this distinction and why these two very different departments need to function separately is critically important. Our primary expertise at AKF is to help successful companies become more successful at delivering digital products. AKF focuses its expertise on helping product development teams succeed. We have developed intellectual property, including the AKF Scale Cube, used to evaluate product architecture and to guide successful product development teams.
We also help CEOs, CTOs, CIOs and IT departments who are looking to improve performance and deliver more business value by creating efficient product development processes and architectures. Give us a call, we can help!
June 27, 2019 | Posted By: Marty Abbott
Bulkhead Pattern Overview
Bulkheads in ships separate components or sections of a ship such that if one portion of a ship is breached, flooding can be contained to that section. Once contained, the ship can continue operations without risk of sinking. In this fashion, ship bulkheads perform a similar function to physical building firewalls, where the firewall is meant to contain a fire to a specific section of the building.
The microservice bulkhead pattern is analogous to the bulkhead on a ship. By separating both functionality and data, failures in some component of a solution do not propagate to other components. This is most commonly employed to help scale what might be otherwise monolithic datastores. The bulkhead is then a pattern for implementing the AKF principle of “swimlanes” or fault isolation.
Problems the Bulkhead Pattern Fixes
The bulkhead pattern helps to fix a number of different quality of service related issues.
- Propagation of Failure: Because solutions are contained and do not share resources (storage, synchronous service-to-service calls, etc), their associated failures are contained and do not propagate. When a service suffers a programmatic (software) or infrastructure failure, no other service is disrupted.
- Noisy Neighbors: If implemented properly, network, storage and compute segmentation ensure that abnormally large resource utilization by a service does not affect other services outside of the bulkhead (fault isolation zone).
- Unusual Demand: The bulkhead protects other resources from services experiencing unpredicted or unusual demand. Other resources do not suffer from TCP port saturation, resulting database deterioration, etc.
Principles to Apply
- Share Nearly Nothing: As much as possible, services that are fault isolated or placed within a bulkhead should not share databases, firewalls, storage, load balancers, etc. Budgetary constraints may limit the application of unique infrastructure to these services. The following diagram helps explain what should never be shared, and what may be shared for cost purposes. The same principles apply, to the extent that they can be managed, within IaaS or PaaS implementations.
- Avoid synchronous calls to other services: Service to service calls extend the failure domain of a bulkhead. Failures and slowness transit blocking synchronous calls and therefore violate the protection offered by a bulkhead.
Put another way, the dimensions of a bulkhead or failure domain is the largest boundary across which no critical infrastructure is shared, and no synchronous inter-service calls exist.
Anti-Patterns to Avoid
The following anti-patterns each rely on either synchronous service to service communication or sharing of data solutions. As such, they represent solutions that should not be present within a bulkhead.
When to use the Bulkhead Pattern
- Apply the bulkhead pattern whenever you want to scale a service independent of other services.
- Apply the bulkhead pattern to fault isolate components of varying risk or availability requirements.
- Apply the bulkhead pattern to isolate geographies for the purposes of increased speed/reduced latency such that distant solutions do not share or communicate and thereby slow response times.
AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures. Give us a call – we can help!
< 1 2 3 4 5 > Last ›