November 6, 2019 | Posted By: Marty Abbott
Backend for Frontend Overview
In the Backend for Frontend pattern, a service (“the backend”) serves as a termination point for a requesting interface (“the frontend”). The backend coordinates all subsequent calls within the solution architecture pursuant to any frontend request.
Backends within this context differ from a traditional API or monolithic gateway. Public APIs are monolithic user interface endpoints, terminating all traffic regardless of modality. For instance, a public API will typically service both browser and mobile traffic.
Contrasted with the “monolithic” public API, backends are segmented by modality, allowing them to serve what may be unique requirements by interface constituent.
Benefits of BFF
The BFF pattern comes with many potential benefits.
• May reduce the chattiness of the client with an implementation by serving as an aggregator and coordinator of requests.
• Smaller and less computationally complex than an all-encompassing monolithic API (segmentation by the Y axis vis-à-vis differing modality of requests)
• Faster time to market as front end teams can have dedicated back end teams serving their unique needs, vs. a combined monolithic team servicing the needs of competing constituent front end teams.
• May offer better results for each front end constituent, vs “in between” solutions that are optimized for neither constituent.
Drawbacks to the BFF Pattern
There are two very obvious drawbacks of the BFF pattern implementation, dealing with fault isolation and the propagation of blast radius for any failure. A handful of additional drawbacks need to be remunerated if BFFs are employed.
• Fan Out: If engineers and architects aren’t careful, there can be a high degree of fan-out between any BFF and associated services it calls. The failure of any of those services can bring down the entire BFF for the interface in question.
• Fuse: Each service, if it responds to multiple BFFs has the capability of bringing down all BFFs and as a result halt all operations. Each individual service then becomes a fuse anti-pattern.
• Duplication and Lower Reuse: There is a high probability that each BFF may implement similar capabilities with different teams, easily doubling (or more) the cost of development. The benefits of faster time to market may warrant this downside, but if it is a major concern some lightweight overhead associated with identifying duplicate efforts may help identify opportunities for shared libraries that get developed once.
• More Services and Components: As we segment backends for each constituent frontend, the number of deployable units increases. This becomes less of a concern if teams have good DevOps practices, great monitoring, lots of automation and good ownership around quality of releases.
When to Use a BFF
If requirements across mobile, browser and other modality constituents vary significantly and the time to market of a single proxy or API becomes problematic, BFFs are a good solution. One must only ensure that you limit the downsides using the following practices.
How to Solve BFF Associated Problems
• Solve Fan Out: Because we don’t want any single subsequent service that any BFF coordinates to take down the BFF entirely, we should implement fault isolation. Each downstream service ideally will have its own BFF termination point for each modality. While this increases the number of deployments again, we get significantly better fault isolation and higher availability. If coordination is necessary between downstream components, rethink the reasons for splitting each subsequent component per our guidance in when to split services
• Remediate Fuses: This is virtually impossible to solve without dedicating a service to each modality/interface BFF. Dedication of deployed services will work if databases aren’t involved, but will not work if each subsequent service needs to share a database as the database now becomes a fuse. So, if a service need not use a database consider separate deployments for maximum availability. If databases are required, accept the fuse as technical debt that is partially remediated by eliminating fan-out above.
• Reuse: This may or not be a problem with your implementation. But if you suspect that functionality will overlap significantly between modalities, it may make sense to ensure teams (perhaps scrum masters and product owners) are identifying “large” work efforts that must be shared. Having teams implement these larger needs in reusable libraries will lower development costs and decrease time to market for other capabilities.
• Service Multiplication: As mentioned above, ensuring that teams “own” their services through the service life and enabling easy release and interaction through automation solves nearly all the concerns of a larger number of deployable services.
AKF Partners has helped hundreds of companies with all of their architectural needs, including implementing microservices architectures. Give us a call – we can help.
October 24, 2019 | Posted By: Marty Abbott
Many of our clients struggle with ensuring that the solutions they create meet the needs of both their business and their customers. The exact symptoms of the above failure vary between clients, and are best explained using anonymous quotes:
We make an initial release of a solution and we never return to it – it feels like it just doesn’t get ‘done’
We rarely measure the success or failure of the solutions we create. Instead we look at business performance – we either hit our numbers or we didn’t. If we don’t hit the numbers, it’s the fault of engineering and product. If we hit our numbers, sales gets credit
I don’t get it – we get a lot of velocity out of our engineering team, but we always have availability problems
A likely cause
The above quotes all point to very different symptoms of a common problem. One discusses level of investment in the product and bringing it to maturity, one to a lack of value measurement and allocation of recognition, and the last to the availability and quality of the solutions that the team creates. The most common problem for each of these symptoms in our experience is that the Agile “Definition of Done” is undefined, implicitly defined or most commonly incompletely defined.
The purpose of “Done”
“Done” is comprised of a number of criteria to be met before any element of work (e.g. a story) is considered “complete” and can be counted for the purposes of velocity.
The benefits of “Done”
• Defining done removes ambiguity and uncertainty, and forces developers, product owners, and scrum team members to align on standards for completion.
• Helps foster good and efficient discussion around paths, tradeoffs, and completion during standups.
• If employed consistently between teams, helps to align and make useful velocity related metrics.
• Limits rework, or post-sprint work for important elements of the solution in question.
• When combined with velocity, incorporates a notion of “earned value” – incenting teams to completion of a solution rather than the typical “effort expended” accounted for at the actual expense of that effort. Put another way, teams don’t get credit for work until something has been completed.
A typical generic definition of “Done”
A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has all necessary support and end user documentation complete
• Has been reviewed by the product owner
Necessary but Insufficient Definition
The above definition, while necessary, is incomplete and therefore insufficient for the needs of a company. The definition fails to account for:
• Business value creation (is it really “done” if we don’t achieve the desired results”?)
• Non-functional requirements necessary to produce value such as availability, scalability, response time, cost of maintenance (cost of operations or goods sold), etc.
As a result, because metrics tied against value creation are not available, attribution for credit of results becomes a subjective process.
Towards a Better Definition
Given the above, a better definition of done should include:
• Non-functional requirements necessary to achieve value creation
• Evaluation that the solution achieves some desired result that is ideally also incorporated into the stories themselves (some measurement our outcome the effort is to achieve)
Modifying the prior definition, we might now have:
A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has delivered all necessary support and end user documentation
• Has been reviewed by the product owner
• Meets the response time objective to end users at peak traffic
• Meets one week of availability target and has passed an availability review
• Meets the cost of goods sold target after one week for infrastructure or IaaS costs
• Meets all other company NFRs (above are examples)
• Shows progress towards or achieves the business metrics it was meant to achieve (may be none for a partial release, or full metrics for the completion of an epic)
Who is Responsible for Evaluating “Done”
This is an agile process, so the team is responsible for their own measurement. In most teams this means the PO and Scrum Master. As a business we also “trust and verify”, so leaders should double-check that value metrics have indeed been met in normal operations reviews.
The Cons of the “Right Definition of Done”
The largest impact is to the time one can realize the earned value component of velocity. Here we’ve been careful to say that the bound is one week after delivery such that velocity is just pushed out by a week to allow for evaluation in production. In addition, there is a bit more record keeping (ostensibly for a scrum master and product owner to evaluate) but the cost of that is incredibly low relative to the alignment to business objectives and customer needs.
Keeping the Old Velocity Metric
There is some value in understanding what gets completed as well as what is truly “done”. If this is the case, just track both velocities. Call one “release velocity” and the other “done velocity” or “value velocity”. The overhead is not that high – scrum masters should easily be able to do this. Now you’ll have metrics to help you understand the gap between what you release first time versus when something finally creates value. This gap is as useful for problem identification as “find-fix” charts in evaluating completion of quality assurance checks.
The Biggest Reason for The Right Definition of Done
Hopefully the answer to this is somewhat obvious: By changing the definition of done, we align ourselves to both our customer and business needs. It helps engineers focus on customer outcomes – rather than just how something should “work”. Engineers too often focus on a problem from their perspective forward
rather than from the customer needs backwards.
Forcing architects to think back from the customer rather than forward from the engineer helps solve problems associated with response time and availability (some of the NFRs above).
September 26, 2019 | Posted By: Marty Abbott
The Problem – Too Much Planning, Too Little Execution
How many of you spend a significant portion of your year planning for the next year, two years, or five years of activities? How often are these plans useful beyond the first three to four months of execution?
We have many large clients who will begin one, two ,or (gasp!) five-year year plans in July or August of the current year. They spend a significant amount of effort creating these plans over a five-to-six-month span of time. The plans are often very specific as to what they will do; what projects they will deliver, what products they will create, how many people they will hire, what training their teams will undertake, etc. The plan is typically well followed in month 1 and 2 and starts to degrade significantly in month 3. By month 6, just before they start the next annual planning cycle, the original plan is at best 50% accurate; the original projects have been replaced, new market intelligence has informed different product solutions, new skills and different teammates are needed, etc.
Hurricanes always have an associated cone of uncertainty. The current position of the hurricane and current direction and velocity are well known. But several factors may cause the hurricane to act differently an hour or a day from now than it is behaving at exactly this moment. The same is true with businesses. We know what we need to do today to maintain our position or gain market share, but those activities may change in priority and number in the next handful of months.
So why do we spend so much time on solutions and approaches when at best 25% of the plan we produce is accurate? We don’t have to waste time as we do today, there is a better way.
Financial vs Operational Plans
First, let’s acknowledge that there is a difference between a financial plan (how much we will spend as a company and what we expect to make as a return on that spend) and an operational plan. The board of directors for your company has a fiduciary responsibility to exercise, in non-expert legal terms:
- A duty of loyalty – the director must put the interests of the institution and its shareholders before his or her own.
- A duty of care – the director must behave prudently, diligently and with skill.
- A duty of obedience – the director must ensure consistency with the purpose of the company – and in a for-profit company, this means ensuring profitability.
Any board of directors, to ensure they are consistent with the law, will require a financial plan. At the very least, they need to govern the spend of the company relative to its revenues to ensure profitability, and ideally over time, an increase in profitability. But that does not mean they need to go into great detail regarding the exact path and actions to achieve the financial plan. We all likely agree that we also have a duty of loyalty, care, and obedience to ourselves and our families – but how many of us go beyond creating an annual budget (financial plan) for any given year?
One of the best known and most successful directors and investors of all time, Warren Buffett, has what amounts to be (in another authors terms) a list of “10 Commandments” for boards and directors. A quick scan of these makes it clear that Buffet’s perspective for board’s focus should be on the performance of the CEO and the company itself – not the detailed operational plan to achieve a financial plan. The board does arguably need to ensure that a strategy exists and is viable – but a strategy need not be a list of tasks for every subordinate organization for an entire year. In fact, given the arguments above, such a task list (or deep operational plan) won’t be followed past a handful of months anyway.
The Fix: Reduce Planning, Increase Execution
If the problem is too much time wasted creating plans that are good for only a short period of time, the fix should be obvious. For this, we offer the AKF 5-95 Rule: spend 5 percent of your time planning and 95% of your time executing. This stands in stark contrast to the “Soviet-esque” way in which many companies operate with executives spending as much as 25 percent of a year involved in financial and execution plans.
- Decrease the horizon (focused endpoint) of planning and decrease the specificity of plans. Take a portion of the five percent of your total time and create a good financial plan of what you would like to achieve. The remainder should be used to iteratively identify the short-term paths to achieve that plan using windows no greater than 3 months. Anything beyond 3 months has a high degree of waste.
- Adopt development methodologies that maximize execution value. Adopt Agile development methodologies meant to embrace low levels of specificity and rely on discovery to identify the “right solution” to maximize market adoption.
The best way to maximize the AKF 5-95 rule is to implement OKRs – (O)bjectives and (K)ey (R)esults as a business, the bowling alley methodology of product focus and Agile product development practices .
September 16, 2019 | Posted By: Marty Abbott
Two of the most common statements we hear from our clients are:
Business: “Our product and engineering teams lack the agility to quickly pivot to the needs of the business”.
Product and Engineering: “Our business lacks the focus and discipline to complete any initiative. We are subject to the ‘Bright Shiny Object (BSO’ or ‘Squirrel!’ phenomenon”.
These two teams seem to be at an impasse in perspective requiring a change by one team or the other for the company to be successful.
Companies need both focus and agility to be successful. While these two concepts may appear to be in conflict, a team need only three things to break the apparent deadlock:
- Shared Context.
- Shared agreement as to the meaning of some key terms.
- Three process approaches across product, the business, and engineering.
First, let’s discuss a common context within which successful businesses in competitive environments operate. Second, we’ll define a common set of terms that should be agreed upon by both the business and engineering. Finally, we’ll dig into the approaches necessary to be successful.
Successful businesses operating within interesting industries attract competition. Competitors seek innovative approaches to disrupt each other and gain market share within the industry. Time to market (TTM) in such an environment is critical, as the company that finds an approach (feature, product, etc.) to shift or gain market share has a compelling advantage for some period. As such, any business in a growth industry must be able to move and pivot quickly (be agile) within its product development initiatives. Put another way, businesses that can afford to stick to a dedicated plan likely are not in a competitive or growing segment, probably don’t have competition, and aren’t likely attractive to investors or employees.
The focus that matters within business is a focus on outcomes. Why focus on outcomes instead of the path to achieve them? Focusing on a path implies a static path, and when is the last time you saw a static path be successful? (Hint: most of us have never seen a static path be successful). Obviously, sometimes outcomes need to change, and we need a process by which we change desired outcomes. But outcomes should change much less frequently than path.
Agility enables changing directions (paths) to achieve focused outcomes. ‘Nuff said.
Commonly known as (O)bjectives and (K)ey (Results), or in AKF parlance Outcomes and Key Results, OKRs are the primary mechanism of focus while allowing for some level of agility in changing outcomes for business needs. Consider the O (objectives or outcomes) as the thing upon which a company is focused, and the Key Results as the activities to achieve those outcomes. KRs should change more frequently than the Os as companies attempt to define better activities to achieve the desired outcomes. An objective/outcome could be “Improve Add-To-Cart/Search ratio by 10%”.
Each objective/outcome should have 3 to 5 supporting activities. For the add-to-cart example above, the activities may implement personalization to drive 3% improvement, add re-targeting for a net 4% improvement, and improve descriptive meta-tags in search for a 3% improvement.
OKRs help enforce transparency across the organization and help create the causal roadmap to success. Subordinate organizations understand how their initiatives nest into the high-level company objectives by following the OKR “tree” from leave to root. By adhering to a strict and small number of high-level objectives, the company creates focus. When tradeoffs must happen, activities not aligned with high level objectives get deprioritized or deferred.
Geoffrey Moore outlines an approach for product organizations to stay focused in their product development efforts. When combined with the notion of a Minimum Viable Product the approach is to stay focused on a single product, initially small, focused on the needs of the pioneers within the technology adoption lifecycle (TALC) for a single target market or industry.
The single product for a single industry (P1T1) or need is the headpin of the bowling alley. The company maintains focus on this until such time as they gain significant adoption within the TALC – ideally a beachhead in the early majority of the TALC.
Only after significant adoption through the TALC (above) does the company then introduce the existing product to target market 2 (P1T2) and begin work on product 2 (or significant extension of product 1) in target market 1 (P2T1).
While OKRs and the Bowling Alley help create focus, Agile product methodologies help product and engineering teams maintain flexibility and agility in development. Epics and stories map to key results within the OKR framework. Short duration development cycles help limit the loss in effort associated with changing key results and help to provide feedback as to whether the current path is likely to meet the objectives and key results within OKRs. Backlogs visible to any Agile team are deep enough to allow for grooming and sizing, but shallow enough such that churn and the resulting morale impact do not jeopardize the velocity of development teams.
Putting it all together:
There is no discrepancy between agility and focus if you:
- Agree to shared definitions of both agility and focus per above
- Jointly agree that both agility and focus are necessary
- Implement OKRs to aid with both agility and focus
- Employ an Agile methodology for product and product development
- Use the TALC in your product management efforts and to help enforce focus on winning markets
September 10, 2019 | Posted By: Greg Fennewald
Let’s briefly review the AKF Scale Cube (read more here), a model describing three methods for scaling technology platforms.
The three axes of the cube are;
- X - horizontal duplication
- Y - segmentation by service or function
- Z - segmentation by customer or geography
Of the three axes by which you can scale your systems, the X axis (horizontal replication) is often the first used at the web and application layers. Load balancing workloads across a pool of identical web and app servers is a standard approach to scale and also improves availability by eliminating SPOFs at the web and app layer. The scale and availability benefits of horizontally duplicating VMs and containers far outweigh the cost. The cost vs. benefit analysis gets more complex as we look at the persistence tier – DBs and storage. DBs are typically one of the most expensive portions of the technology stack, especially if licensed RDBMS are used. Storage costs have been dropping but can still be considerable. Does X axis scaling make sense at the persistence tier? In many cases, yes, particularly if there is a defined time period the X axis split is expected to serve.
As compared to Y and Z axis scaling, X axis features some advantages;
- Relatively easy to do
- Fast to implement
- Scales transactional systems well
X axis scaling also has a disadvantage compared to the other axes – cost. Duplicating systems and storage gets expensive quickly. This reinforces the notion of apply X axis in the persistence tier for a defined period before shifting to Y and/or Z axis scaling.
What situations lend themselves well to X axis scaling in the persistence tier?
- High read to write ratios – employ a write master and multiple read slaves
- Reporting and BI – run reporting and BI workloads against a replica DB
- Search – deploy a caching layer backed by a replica DB
Consider a situation where a monolithic codebase was written to get the minimum viable product out the door – a sound choice in the early day of a startup. Success driven growth is now straining the system. Multiple web and app servers are in use, but they all call a single database. Reporting is also run against the same DB. There is increasing interest in productizing reporting, something sure to increase DB workload. Company leadership realizes they need to architect for scale quickly and have chosen a services-oriented architecture. Engineering will refactor the monolith into independent services, communicating asynchronously – a Y axis split. This choice will improve scalability and performance, all good things, but it will take time. 12 months is the planning figure. The existing persistence tier will not survive that long.
In this situation, applying X axis scaling to the persistence tier can buy time to complete the Y axis refactoring. The additional cost of the replica DB and storage are for a defined period and the cost rate will decline as the Y axis refactoring is implemented.
Interested in learning more? Contact us, we’ve walked a mile in your shoes.
September 6, 2019 | Posted By: Pete Ferguson
In many of our technical due diligence engagements, it is common to find that companies are building tools with considerable development effort (and ongoing maintenance) for something that is not part of their core strength and thus providing a competitive advantage. What criteria does your organization us in deciding when to build vs. buy?
If you perform a simple web search for “build vs. buy” you will find hundreds of articles, process flows, and decision trees on when to build and when to buy. Many of these are cost-centric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
We have many examples from our customers developing load balancing software, building their own databases, etc. In nearly every case, a significant percentage of the engineering team (and engineering cost) go into a solution that:
- Does not offer long term competitive differentiation
- Costs more than purchasing an existing product
- Steals focus away from the engineering team
- Is not aligned with the skills or business level outcomes of the team
If You Can’t Beat Them - Join Them
(or buy, rent, or license from them)
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. DOES THIS “THING” (PRODUCT / ARCHITECTURAL COMPONENT / FUNCTION) CREATE STRATEGIC DIFFERENTIATION IN OUR BUSINESS
Shiny object distraction is a very real thing we observe regularly. Companies start - innocently enough - building a custom tool in a pinch to get them by, but never go back and reassess the decision. Over time the solution snowballs and consumes more and more resources that should be focused on innovating strategic differentiation.
- We have yet to hear a tech exec say “we just have too many developers, we aren’t sure what to do with them.”
- More often than not “resource constraints” is mentioned within the first few hours of our engagements.
- If building instead of buying is going to distract from focusing efforts on the next “big thing” – then 99% of the time you should just stop here and attempt to find a packaged product, open-source solution, or outsourcing vendor to build what you need.
If after reviewing these points, if the answer is “Yes, it will provide a strategic differentiation,” then proceed to question 2.
2. ARE WE THE BEST COMPANY TO BUILD THIS “THING”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else.
For instance, if you are a social networking site, you probably don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”.
And please, don’t fool yourself – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. ARE THERE FEW OR NO COMPETING PRODUCTS TO THIS “THING” THAT YOU WANT TO CREATE?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision.
- If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity.
- Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time.
- A “build” decision today will look bad tomorrow as features converge and pricing declines.
If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. CAN WE BUILD THIS “THING” COST EFFECTIVELY?
- Is it cheaper to build than buy when considering the total lifecycle (implementation through end-of-life) of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc
- If your business REALLY grows and is extremely successful, do you want to be continuing to support internally-developed monitoring and logging solutions, mobile architecture, payments, etc. through the life of your product?
Don’t fool yourself into answering this affirmatively just because you want to work on something “neat.” Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing,” but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes.”
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a concern) anytime you answer “No” to any question above.
While startups and small companies roll their own tools early on to get product out the door, as they grow, the timeline of planning (and related costs) needs to increase from the next sprint to a longer-term annual and multi-year strategy. That, plus growth, tips the scale to buy instead of build. The more internal products produced and supported, the more tech debt is required and distracts medium-to-large organizations from competing against the next startup.
While building custom tools and products seems to make sense in the immediate term, looking at the long-term strategy and desired outcome of your organization needs to be fully-weighted in the decision process. Distraction from focus is the number one harm we have seen many times with our clients as they fall behind the competition and burn sprint cycles on maintaining products that don’t move the needle with their customers. The crippling cost of distractions is what causes successful companies from losing their competitive advantage as well as slipping into oblivion.
Like the ugly couch your auntie gave you for your first apartment, it can often be difficult to assess what makes sense without an outside opinion. Contact us, we can help!
September 2, 2019 | Posted By: Greg Fennewald
As a company matures from a startup to a growing business, there are a number of measurables that become table stakes – basic tools for managing a business. These measurables include financial reporting statements, departmental budgets, KPIs, and OKRs. Another key measurable is the availability of your product or service and this measurable should be owned by the technology team.
When we ask clients about availability goals or SLAs, some do not have it documented and say something along the lines of “we want our service to always be available”. While a nice sentiment, unblemished availability is virtually impossible to achieve and prohibitively expensive to pursue. Availability goals must be relevant to the shared business outcomes of the company.
If you are not measuring availability, start. If nothing else, the data will inform what your architecture and process can do today, providing a starting point if the business chooses to pursue availability improvements.
Some clients who do have an availability measurable use a percentage of clock time – 99.95% for example. This is certainly better than no measurable at all, but still leaves a lot to be desired.
Reasons why clock time is not the best measure for availability:
- Units of time are not equal in terms of business impact – a disruption during the busiest part of the day would be worse than an issue during a slow period. This is intrinsically known as many companies schedule maintenance windows for late at night or early in the morning, periods where the impact of disruption is smaller.
- The business communicates in business terms (revenue, cost, margin, return on investment) and these terms are measured in dollars, not clock time.
- Using the uptime figure from a server or other infrastructure component as an availability measure is inaccurate because it does not capture software bugs or other issues rendering your service inoperative despite the server uptime status.
Now that we’ve established that availability should be measured and that clock time is not the best unit of measure, what is a better choice? Transactional metrics aligned to the desired business outcome are the better choice.
- Rates – log transactional rates such as logins, add to cart, registration, downloads, orders, etc. Apply Statistical Process Control or other analysis methods to establish thresholds indicating an unusual deviation in the transaction rate.
- Ratios – the proportion of undesired or failed outcomes such as failed logins, abandoned shopping carts, and HTTP 400s can be useful for measuring the quality of service. Analysis of such ratios will establish unusual deviation levels.
- Patterns – transaction patterns can identify expected activity, such as order rates increasing when an item is first available for sale or download rates increasing in response to a viral social media video. The absence of an expected pattern change can signal an availability issue with your product or service.
Alignment with Desired Outcomes
What are the goals of your business? What is your value proposition? Choose metrics that comprehensively measure the availability of your product or service. The ability of a customer to buy a product from your website (login, search, add to cart, and check out). The proportion of file downloads successfully completed in less than 4 seconds. The success rate of posting a message to a social media platform and the ability of others to view it. Measuring availability with metrics aligned with the desired outcomes keeps the big picture at the forefront and helps business colleagues understand how the technology team contributes to value creation.
Not measuring availability is bad. Measuring it in clock time is better, but still leaves something to be desired. Measuring availability with transactional metrics tied to the desired business outcome is best. Don’t settle for better when you can be best.
Interested in learning more? Struggling with analyzing data? Unsure of how to apply architectural principles to achieve higher availability? Contact us, we’ve been in your shoes.
(Image Credit: Sarah Pflug from Burst)
August 22, 2019 | Posted By: AKF
For over a decade, AKF has been on a number of engagements where we have seen technology organizations put off a large portion of engineering effort wrangling with technical debt. Technical due diligence, as laid out in Technology Due Diligence Checklist, should help identify the amount of technical debt and quantify the amount of engineering resources dedicated to servicing the debt.
What is Technical Debt?
Technical debt is the difference between doing something the desired or best way and doing something quickly. Technical debt is a conscious choice, made knowingly, and with commission to take a shortcut in the technology arena – the delta between the desired or intended way and quicker way. The shortcut is usually taken for time to market reasons and is a sound business decision within reason.
Is Technical Debt Bad?
Technical debt is analogous in many ways to financial debt – a complete lack of it probably means missed business opportunities while an excess means disaster around the corner. Just like financial debt, technical debt is not necessarily bad.
Accruing some debt allows the technology organization to release a minimal viable product to customers, just as some financial debt allows a company to start new investments earlier than capital growth would allow.
Too little debt can result in a product late to market in a competitive environment and too much debt can choke business innovation and cause availability and scalability issues later in life. Tech debt becomes bad when the engineering organization can no longer service that debt.
Technical Debt Maintenance
Similar to financial debt, technical debt must be serviced, and it is serviced by the efforts of the engineering team. A failure to service technical debt will result in high-interest payments as seen by slowing time to market for new product initiatives post-investment.
Our experience indicates that most companies should expect to spend 12% to 25% of engineering effort on servicing technical debt. Whether that resource allocation keeps the debt static, reduces it, or allows it to grow depends upon the amount of technical debt and also influences the level of spend. It is easy to see how a company delinquent in servicing their technical debt will have to increase the resource allocation to deal with it, reducing resources for product innovation and market responsiveness.
Technical Debt Takeaways:
Choosing to take on tech debt by delaying attention to address technical issues allows greater resources to be focused on higher priority endeavors
The absence of technical debt probably means missed business opportunities – use technical debt as a tool to best meet the needs of the business
Excessive technical debt will cause availability and scalability issues, and can choke business innovation (too much engineering time dealing with debt rather than focusing on the product)
The interest of tech debt is the difficulty or increased level of effort in modifying something in subsequent releases
The principal of technical debt is the difference between desired and actual quality or features in a service or product
Technology resources to continually service technical debt should be clearly planned in product road maps - 12 to 25% is suggested
AKF’s Technical Due Diligence can discover a team’s ability to quantify the amount of debt accrued and the engineering effort to service the debt. Contact us, we can help!
August 21, 2019 | Posted By: Bill Armelin
At AKF Partners, we believe in learning aggressively, not just from your successes, but also your failures. One common failure we see are service disrupting incidents. These are the events that either make your systems unavailable or significantly degrade performance for your customers. They result in lost revenue, poor customer satisfaction and hours of lost sleep. While there are many things we can do to reduce the probability of an incident occurring or the impact if it does happen, we know that all systems fail.
We like to say, “An incident is a terrible thing to waste.” The damage is already done. Now, we need to learn as much about the causes of the incident to prevent the same failures from happening again. A common process for determining the causes of failure and preventing them from reoccurring is the postmortem. In the Army, it is called an After-Action Review. In many companies it is called a Root Cause Analysis. It doesn’t matter what you call it, as long as you do it.
We actually avoid using Root Cause Analysis. Many of our clients that use the term focus too much on finding that one “root cause” of the issue. There will never be a single cause to an incident. There will always be a chain of problems with a trigger or proximate event. This is the one event that causes the system to finally topple over. We need a process that digs into the entire chain of events inclusive of the trigger. This is where the postmortem comes in. It is a cross-functional brainstorming meeting that not only identifies the root causes of a problem, but also help in identifying issues with process and training.
Postmortem Process – TIA
The purpose of a good postmortem is to find all of the contributing events and problems that caused an incident. We use a simple three step process called TIA. TIA stands for imeline, ssues, and ctions.
First, we create a timeline of events leading up the issue, as well as the timeline of all the actions taken to restore service. There are multiple ways to collect the timeline of events. Some companies have a scribe that records events during the incident process. Increasingly, we are seeing companies use chat tools like Slack to record events related to restoration. The timestamp in Slack for the message is a good place to extract the timeline. Don’t start your timeline at the beginning of the incident. It starts with the activities prior to the incident that cause the triggering event (e.g. a code deployment). During the postmortem meeting, augment the timeline with additional details.
The second part of TIA is Issues. This is where we walkthrough the timeline and identify issues. We want to focus on people, process, and technology. We want to capture all of the things that either allowed the incident to happen (e.g. lack of monitoring), directly triggered it (e.g. a code push), or increased the time to restore the system to a stable state (e.g. could get the right people on the call). List each issue separately. At this point, there is no discussion about fixing issues, we only focus on the timeline and identifying issues. There is also no reference to ownership. We also don’t want to assign blame. We want a process that provides constructive feedback to solve problems.
Avoid the tendency to find a single triggering event and stop. Make sure you continue to dig into the issues to determine why things happened the way they did. We like to use the “5-whys” methodology to explore root causes. This entails repeatedly asking questions about why something happened. The answer to one question becomes the basis for the next. We continue to ask why until we have identified the true causes of the problems.
Here is a summary of anti-patterns we see when companies conduct postmortems:
|Not conducting a postmortem after a serious (e.g. Sev 1) incident
||Conduct a postmortem within a week after a serious incident
||Avoid blame and keep it constructive
|Not having the right people involved
||Assemble a cross functional team of people involved or needed to resolve problems
|Using a postmortem block (e.g. multiple postmortems during a 1-hour session every two weeks)
||Dedicate time for a postmortem based on the severity of the incident
|Lack of ownership of identified tasks
||Make one person accountable to complete a task within an appropriate timeframe
|Not digging far enough into issues (finding a single root cause)
||Use the 5-Why methodology to identify all of the causes for an issue
Incidents will always happen. What you do after service restoration will determine if the problem occurs again. A structured, timely postmortem process will help identify the issues causing outages and help prevent their reoccurrence in the future. It also fosters a culture of learning from your mistakes without blame.
Are you struggling with the same issues impacting your site? Do you know you should be conducting postmortems but don’t know how to get started? AKF can help you establish critical incident management and postmortem processes. Call us – we can help!
August 20, 2019 | Posted By: Dave Berardi
If your company doesn’t utilize one of the big cloud providers for either IaaS or PaaS as part of product infrastructure, it’s only a matter of time. We often find our clients in situations where they are pressured to move quickly for benefit-realization to improve many aspects of their business.
Drivers of this trend that exist across our client base and the industry include:
- The Need For Speed and Time To Market: The need to scale capacity quickly without waiting weeks or months for hardware procurement and provisioning in your own datacenter or colo.
- Traditional On-Prem Software Dying by 1000 Cuts: Demand-side (buyer) forces are encouraging companies to get services and software out of data centers. Cloud-native SaaS competition is pressuring what’s left of the on-prem software providers.
- Legacy Company Talent Challenges: The inability of the old economy companies to hire engineering talent to support on-prem software in house.
Several different approaches can be used for migration. We’ve seen many of them and there are two on opposite ends of the spectrum – Lift and Shift and Cloud-Native – that we want to unpack.
The Lift and Shift Approach:
What is it?
Put simply, this is when the same architecture, resources, and services from an on-prem or colo data center are moved up into a cloud provider. Often VMs from on-prem hosting centers are converted and dropped into reserved virtual compute instances. Tools such as AWS Connector for vCenter or GCP’s Velostrata, in theory, allow for an easy transition.
- Fastest path to cloud
- Same architecture and tech stack minimizes training need – infrastructure management does require knowledge of the console
- Least costly in terms of planning, architecture changes, refactoring
- Monolithic nature of the architecture can prove to be costly thru BYOL and compute requirements
- Minimal use of native elasticity and resources create cost-inefficient use of compute, memory, and storage and may not perform as needed
- Technical debt migrates with the product and cost could be magnified with additional problems and a shift to the pay for use model
While Lift and Shift seems to be the easiest path, you need to be aware of the strong potential for an increase in cost in the cloud. Running VMs in your own DC and colo masks the cost inefficiencies since they are all part of Capex for your compute, storage, and network. When you move to public cloud the provider will promise to be cheaper. But in the cloud you will pay for every reserved CPU that isn’t utilized, storage that isn’t used, and other idle resources. Further, your availability can only be as good as the provider’s uptime for a given Region and/or Availability Zone.
Cloud Native Approach:
What is it?
Cloud-Native approach ultimately allows for the use of a provider’s cloud services as long as there are requests and demand being created by product users. This approach almost always requires investment into splitting the monolith and moving to a services-separated architecture. In addition, it could require you to use native services in your provider of choice. Doing so lets you move from paying for provisioned infrastructure to consumption-based services with better cost-efficiency.
- Less time needed to manage infrastructure and more time for features and experimentation
- Easier to scale out using native services
- Most cost-efficient
- Slowest path to cloud
- More discovery and training - this approach requires your teams to understand the current tech stack in order to recreate them in cloud. From a cloud perspective they must understand how the provider of choice works so that decisions can be made on native services.
- Increased risk of vendor lock-in (eg. Building out event-driven services with rules inside of native serverless)
The Cloud Native path is a longer one, but provides several benefits that will yield more value over time. With this approach you must spend time determining how to split up your monolith and how to best leverage the right combination of Availability Zones, Regions, and use of native services depending on your Recovery Time Objective (RTO) and Recovery Point Objectives (RPO). We prefer to solve scalability and availability problems with systems and software architecture to avoid vendor lock-in. All of the trade-offs on such a journey must be understood.
We have helped several companies of various sizes move to the cloud going thru SaaS transformations and have engaged in reviewing proposed architectures. Contact us to see how we can help.
1 2 3 > Last ›