October 24, 2019 | Posted By: Marty Abbott
Many of our clients struggle with ensuring that the solutions they create meet the needs of both their business and their customers. The exact symptoms of the above failure vary between clients, and are best explained using anonymous quotes:
We make an initial release of a solution and we never return to it – it feels like it just doesn’t get ‘done’
We rarely measure the success or failure of the solutions we create. Instead we look at business performance – we either hit our numbers or we didn’t. If we don’t hit the numbers, it’s the fault of engineering and product. If we hit our numbers, sales gets credit
I don’t get it – we get a lot of velocity out of our engineering team, but we always have availability problems
A likely cause
The above quotes all point to very different symptoms of a common problem. One discusses level of investment in the product and bringing it to maturity, one to a lack of value measurement and allocation of recognition, and the last to the availability and quality of the solutions that the team creates. The most common problem for each of these symptoms in our experience is that the Agile “Definition of Done” is undefined, implicitly defined or most commonly incompletely defined.
The purpose of “Done”
“Done” is comprised of a number of criteria to be met before any element of work (e.g. a story) is considered “complete” and can be counted for the purposes of velocity.
The benefits of “Done”
• Defining done removes ambiguity and uncertainty, and forces developers, product owners, and scrum team members to align on standards for completion.
• Helps foster good and efficient discussion around paths, tradeoffs, and completion during standups.
• If employed consistently between teams, helps to align and make useful velocity related metrics.
• Limits rework, or post-sprint work for important elements of the solution in question.
• When combined with velocity, incorporates a notion of “earned value” – incenting teams to completion of a solution rather than the typical “effort expended” accounted for at the actual expense of that effort. Put another way, teams don’t get credit for work until something has been completed.
A typical generic definition of “Done”
A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has all necessary support and end user documentation complete
• Has been reviewed by the product owner
Necessary but Insufficient Definition
The above definition, while necessary, is incomplete and therefore insufficient for the needs of a company. The definition fails to account for:
• Business value creation (is it really “done” if we don’t achieve the desired results”?)
• Non-functional requirements necessary to produce value such as availability, scalability, response time, cost of maintenance (cost of operations or goods sold), etc.
As a result, because metrics tied against value creation are not available, attribution for credit of results becomes a subjective process.
Towards a Better Definition
Given the above, a better definition of done should include:
• Non-functional requirements necessary to achieve value creation
• Evaluation that the solution achieves some desired result that is ideally also incorporated into the stories themselves (some measurement our outcome the effort is to achieve)
Modifying the prior definition, we might now have:
A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has delivered all necessary support and end user documentation
• Has been reviewed by the product owner
• Meets the response time objective to end users at peak traffic
• Meets one week of availability target and has passed an availability review
• Meets the cost of goods sold target after one week for infrastructure or IaaS costs
• Meets all other company NFRs (above are examples)
• Shows progress towards or achieves the business metrics it was meant to achieve (may be none for a partial release, or full metrics for the completion of an epic)
Who is Responsible for Evaluating “Done”
This is an agile process, so the team is responsible for their own measurement. In most teams this means the PO and Scrum Master. As a business we also “trust and verify”, so leaders should double-check that value metrics have indeed been met in normal operations reviews.
The Cons of the “Right Definition of Done”
The largest impact is to the time one can realize the earned value component of velocity. Here we’ve been careful to say that the bound is one week after delivery such that velocity is just pushed out by a week to allow for evaluation in production. In addition, there is a bit more record keeping (ostensibly for a scrum master and product owner to evaluate) but the cost of that is incredibly low relative to the alignment to business objectives and customer needs.
Keeping the Old Velocity Metric
There is some value in understanding what gets completed as well as what is truly “done”. If this is the case, just track both velocities. Call one “release velocity” and the other “done velocity” or “value velocity”. The overhead is not that high – scrum masters should easily be able to do this. Now you’ll have metrics to help you understand the gap between what you release first time versus when something finally creates value. This gap is as useful for problem identification as “find-fix” charts in evaluating completion of quality assurance checks.
The Biggest Reason for The Right Definition of Done
Hopefully the answer to this is somewhat obvious: By changing the definition of done, we align ourselves to both our customer and business needs. It helps engineers focus on customer outcomes – rather than just how something should “work”. Engineers too often focus on a problem from their perspective forward
rather than from the customer needs backwards.
Forcing architects to think back from the customer rather than forward from the engineer helps solve problems associated with response time and availability (some of the NFRs above).
September 26, 2019 | Posted By: Marty Abbott
The Problem – Too Much Planning, Too Little Execution
How many of you spend a significant portion of your year planning for the next year, two years, or five years of activities? How often are these plans useful beyond the first three to four months of execution?
We have many large clients who will begin one, two ,or (gasp!) five-year year plans in July or August of the current year. They spend a significant amount of effort creating these plans over a five-to-six-month span of time. The plans are often very specific as to what they will do; what projects they will deliver, what products they will create, how many people they will hire, what training their teams will undertake, etc. The plan is typically well followed in month 1 and 2 and starts to degrade significantly in month 3. By month 6, just before they start the next annual planning cycle, the original plan is at best 50% accurate; the original projects have been replaced, new market intelligence has informed different product solutions, new skills and different teammates are needed, etc.
Hurricanes always have an associated cone of uncertainty. The current position of the hurricane and current direction and velocity are well known. But several factors may cause the hurricane to act differently an hour or a day from now than it is behaving at exactly this moment. The same is true with businesses. We know what we need to do today to maintain our position or gain market share, but those activities may change in priority and number in the next handful of months.
So why do we spend so much time on solutions and approaches when at best 25% of the plan we produce is accurate? We don’t have to waste time as we do today, there is a better way.
Financial vs Operational Plans
First, let’s acknowledge that there is a difference between a financial plan (how much we will spend as a company and what we expect to make as a return on that spend) and an operational plan. The board of directors for your company has a fiduciary responsibility to exercise, in non-expert legal terms:
- A duty of loyalty – the director must put the interests of the institution and its shareholders before his or her own.
- A duty of care – the director must behave prudently, diligently and with skill.
- A duty of obedience – the director must ensure consistency with the purpose of the company – and in a for-profit company, this means ensuring profitability.
Any board of directors, to ensure they are consistent with the law, will require a financial plan. At the very least, they need to govern the spend of the company relative to its revenues to ensure profitability, and ideally over time, an increase in profitability. But that does not mean they need to go into great detail regarding the exact path and actions to achieve the financial plan. We all likely agree that we also have a duty of loyalty, care, and obedience to ourselves and our families – but how many of us go beyond creating an annual budget (financial plan) for any given year?
One of the best known and most successful directors and investors of all time, Warren Buffett, has what amounts to be (in another authors terms) a list of “10 Commandments” for boards and directors. A quick scan of these makes it clear that Buffet’s perspective for board’s focus should be on the performance of the CEO and the company itself – not the detailed operational plan to achieve a financial plan. The board does arguably need to ensure that a strategy exists and is viable – but a strategy need not be a list of tasks for every subordinate organization for an entire year. In fact, given the arguments above, such a task list (or deep operational plan) won’t be followed past a handful of months anyway.
The Fix: Reduce Planning, Increase Execution
If the problem is too much time wasted creating plans that are good for only a short period of time, the fix should be obvious. For this, we offer the AKF 5-95 Rule: spend 5 percent of your time planning and 95% of your time executing. This stands in stark contrast to the “Soviet-esque” way in which many companies operate with executives spending as much as 25 percent of a year involved in financial and execution plans.
- Decrease the horizon (focused endpoint) of planning and decrease the specificity of plans. Take a portion of the five percent of your total time and create a good financial plan of what you would like to achieve. The remainder should be used to iteratively identify the short-term paths to achieve that plan using windows no greater than 3 months. Anything beyond 3 months has a high degree of waste.
- Adopt development methodologies that maximize execution value. Adopt Agile development methodologies meant to embrace low levels of specificity and rely on discovery to identify the “right solution” to maximize market adoption.
The best way to maximize the AKF 5-95 rule is to implement OKRs – (O)bjectives and (K)ey (R)esults as a business, the bowling alley methodology of product focus and Agile product development practices .
September 16, 2019 | Posted By: Marty Abbott
Two of the most common statements we hear from our clients are:
Business: “Our product and engineering teams lack the agility to quickly pivot to the needs of the business”.
Product and Engineering: “Our business lacks the focus and discipline to complete any initiative. We are subject to the ‘Bright Shiny Object (BSO’ or ‘Squirrel!’ phenomenon”.
These two teams seem to be at an impasse in perspective requiring a change by one team or the other for the company to be successful.
Companies need both focus and agility to be successful. While these two concepts may appear to be in conflict, a team need only three things to break the apparent deadlock:
- Shared Context.
- Shared agreement as to the meaning of some key terms.
- Three process approaches across product, the business, and engineering.
First, let’s discuss a common context within which successful businesses in competitive environments operate. Second, we’ll define a common set of terms that should be agreed upon by both the business and engineering. Finally, we’ll dig into the approaches necessary to be successful.
Successful businesses operating within interesting industries attract competition. Competitors seek innovative approaches to disrupt each other and gain market share within the industry. Time to market (TTM) in such an environment is critical, as the company that finds an approach (feature, product, etc.) to shift or gain market share has a compelling advantage for some period. As such, any business in a growth industry must be able to move and pivot quickly (be agile) within its product development initiatives. Put another way, businesses that can afford to stick to a dedicated plan likely are not in a competitive or growing segment, probably don’t have competition, and aren’t likely attractive to investors or employees.
The focus that matters within business is a focus on outcomes. Why focus on outcomes instead of the path to achieve them? Focusing on a path implies a static path, and when is the last time you saw a static path be successful? (Hint: most of us have never seen a static path be successful). Obviously, sometimes outcomes need to change, and we need a process by which we change desired outcomes. But outcomes should change much less frequently than path.
Agility enables changing directions (paths) to achieve focused outcomes. ‘Nuff said.
Commonly known as (O)bjectives and (K)ey (Results), or in AKF parlance Outcomes and Key Results, OKRs are the primary mechanism of focus while allowing for some level of agility in changing outcomes for business needs. Consider the O (objectives or outcomes) as the thing upon which a company is focused, and the Key Results as the activities to achieve those outcomes. KRs should change more frequently than the Os as companies attempt to define better activities to achieve the desired outcomes. An objective/outcome could be “Improve Add-To-Cart/Search ratio by 10%”.
Each objective/outcome should have 3 to 5 supporting activities. For the add-to-cart example above, the activities may implement personalization to drive 3% improvement, add re-targeting for a net 4% improvement, and improve descriptive meta-tags in search for a 3% improvement.
OKRs help enforce transparency across the organization and help create the causal roadmap to success. Subordinate organizations understand how their initiatives nest into the high-level company objectives by following the OKR “tree” from leave to root. By adhering to a strict and small number of high-level objectives, the company creates focus. When tradeoffs must happen, activities not aligned with high level objectives get deprioritized or deferred.
Geoffrey Moore outlines an approach for product organizations to stay focused in their product development efforts. When combined with the notion of a Minimum Viable Product the approach is to stay focused on a single product, initially small, focused on the needs of the pioneers within the technology adoption lifecycle (TALC) for a single target market or industry.
The single product for a single industry (P1T1) or need is the headpin of the bowling alley. The company maintains focus on this until such time as they gain significant adoption within the TALC – ideally a beachhead in the early majority of the TALC.
Only after significant adoption through the TALC (above) does the company then introduce the existing product to target market 2 (P1T2) and begin work on product 2 (or significant extension of product 1) in target market 1 (P2T1).
While OKRs and the Bowling Alley help create focus, Agile product methodologies help product and engineering teams maintain flexibility and agility in development. Epics and stories map to key results within the OKR framework. Short duration development cycles help limit the loss in effort associated with changing key results and help to provide feedback as to whether the current path is likely to meet the objectives and key results within OKRs. Backlogs visible to any Agile team are deep enough to allow for grooming and sizing, but shallow enough such that churn and the resulting morale impact do not jeopardize the velocity of development teams.
Putting it all together:
There is no discrepancy between agility and focus if you:
- Agree to shared definitions of both agility and focus per above
- Jointly agree that both agility and focus are necessary
- Implement OKRs to aid with both agility and focus
- Employ an Agile methodology for product and product development
- Use the TALC in your product management efforts and to help enforce focus on winning markets
September 10, 2019 | Posted By: Greg Fennewald
Let’s briefly review the AKF Scale Cube, a model describing three methods for scaling technology platforms.
The three axes of the cube are;
- X - horizontal duplication
- Y - segmentation by service or function
- Z - segmentation by customer or geography
Of the three axes by which you can scale your systems, the X axis (horizontal replication) is often the first used at the web and application layers. Load balancing workloads across a pool of identical web and app servers is a standard approach to scale and also improves availability by eliminating SPOFs at the web and app layer. The scale and availability benefits of horizontally duplicating VMs and containers far outweigh the cost. The cost vs. benefit analysis gets more complex as we look at the persistence tier – DBs and storage. DBs are typically one of the most expensive portions of the technology stack, especially if licensed RDBMS are used. Storage costs have been dropping but can still be considerable. Does X axis scaling make sense at the persistence tier? In many cases, yes, particularly if there is a defined time period the X axis split is expected to serve.
As compared to Y and Z axis scaling, X axis features some advantages;
- Relatively easy to do
- Fast to implement
- Scales transactional systems well
X axis scaling also has a disadvantage compared to the other axes – cost. Duplicating systems and storage gets expensive quickly. This reinforces the notion of apply X axis in the persistence tier for a defined period before shifting to Y and/or Z axis scaling.
What situations lend themselves well to X axis scaling in the persistence tier?
- High read to write ratios – employ a write master and multiple read slaves
- Reporting and BI – run reporting and BI workloads against a replica DB
- Search – deploy a caching layer backed by a replica DB
Consider a situation where a monolithic codebase was written to get the minimum viable product out the door – a sound choice in the early day of a startup. Success driven growth is now straining the system. Multiple web and app servers are in use, but they all call a single database. Reporting is also run against the same DB. There is increasing interest in productizing reporting, something sure to increase DB workload. Company leadership realizes they need to architect for scale quickly and have chosen a services-oriented architecture. Engineering will refactor the monolith into independent services, communicating asynchronously – a Y axis split. This choice will improve scalability and performance, all good things, but it will take time. 12 months is the planning figure. The existing persistence tier will not survive that long.
In this situation, applying X axis scaling to the persistence tier can buy time to complete the Y axis refactoring. The additional cost of the replica DB and storage are for a defined period and the cost rate will decline as the Y axis refactoring is implemented.
Interested in learning more? Contact us, we’ve walked a mile in your shoes.
September 6, 2019 | Posted By: Pete Ferguson
In many of our technical due diligence engagements, it is common to find that companies are building tools with considerable development effort (and ongoing maintenance) for something that is not part of their core strength and thus providing a competitive advantage. What criteria does your organization us in deciding when to build vs. buy?
If you perform a simple web search for “build vs. buy” you will find hundreds of articles, process flows, and decision trees on when to build and when to buy. Many of these are cost-centric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
We have many examples from our customers developing load balancing software, building their own databases, etc. In nearly every case, a significant percentage of the engineering team (and engineering cost) go into a solution that:
- Does not offer long term competitive differentiation
- Costs more than purchasing an existing product
- Steals focus away from the engineering team
- Is not aligned with the skills or business level outcomes of the team
If You Can’t Beat Them - Join Them
(or buy, rent, or license from them)
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. DOES THIS “THING” (PRODUCT / ARCHITECTURAL COMPONENT / FUNCTION) CREATE STRATEGIC DIFFERENTIATION IN OUR BUSINESS
Shiny object distraction is a very real thing we observe regularly. Companies start - innocently enough - building a custom tool in a pinch to get them by, but never go back and reassess the decision. Over time the solution snowballs and consumes more and more resources that should be focused on innovating strategic differentiation.
- We have yet to hear a tech exec say “we just have too many developers, we aren’t sure what to do with them.”
- More often than not “resource constraints” is mentioned within the first few hours of our engagements.
- If building instead of buying is going to distract from focusing efforts on the next “big thing” – then 99% of the time you should just stop here and attempt to find a packaged product, open-source solution, or outsourcing vendor to build what you need.
If after reviewing these points, if the answer is “Yes, it will provide a strategic differentiation,” then proceed to question 2.
2. ARE WE THE BEST COMPANY TO BUILD THIS “THING”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else.
For instance, if you are a social networking site, you probably don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”.
And please, don’t fool yourself – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. ARE THERE FEW OR NO COMPETING PRODUCTS TO THIS “THING” THAT YOU WANT TO CREATE?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision.
- If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity.
- Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time.
- A “build” decision today will look bad tomorrow as features converge and pricing declines.
If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. CAN WE BUILD THIS “THING” COST EFFECTIVELY?
- Is it cheaper to build than buy when considering the total lifecycle (implementation through end-of-life) of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc
- If your business REALLY grows and is extremely successful, do you want to be continuing to support internally-developed monitoring and logging solutions, mobile architecture, payments, etc. through the life of your product?
Don’t fool yourself into answering this affirmatively just because you want to work on something “neat.” Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing,” but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes.”
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a concern) anytime you answer “No” to any question above.
While startups and small companies roll their own tools early on to get product out the door, as they grow, the timeline of planning (and related costs) needs to increase from the next sprint to a longer-term annual and multi-year strategy. That, plus growth, tips the scale to buy instead of build. The more internal products produced and supported, the more tech debt is required and distracts medium-to-large organizations from competing against the next startup.
While building custom tools and products seems to make sense in the immediate term, looking at the long-term strategy and desired outcome of your organization needs to be fully-weighted in the decision process. Distraction from focus is the number one harm we have seen many times with our clients as they fall behind the competition and burn sprint cycles on maintaining products that don’t move the needle with their customers. The crippling cost of distractions is what causes successful companies from losing their competitive advantage as well as slipping into oblivion.
Like the ugly couch your auntie gave you for your first apartment, it can often be difficult to assess what makes sense without an outside opinion. Contact us, we can help!
September 2, 2019 | Posted By: Greg Fennewald
As a company matures from a startup to a growing business, there are a number of measurables that become table stakes – basic tools for managing a business. These measurables include financial reporting statements, departmental budgets, KPIs, and OKRs. Another key measurable is the availability of your product or service and this measurable should be owned by the technology team.
When we ask clients about availability goals or SLAs, some do not have it documented and say something along the lines of “we want our service to always be available”. While a nice sentiment, unblemished availability is virtually impossible to achieve and prohibitively expensive to pursue. Availability goals must be relevant to the shared business outcomes of the company.
If you are not measuring availability, start. If nothing else, the data will inform what your architecture and process can do today, providing a starting point if the business chooses to pursue availability improvements.
Some clients who do have an availability measurable use a percentage of clock time – 99.95% for example. This is certainly better than no measurable at all, but still leaves a lot to be desired.
Reasons why clock time is not the best measure for availability:
- Units of time are not equal in terms of business impact – a disruption during the busiest part of the day would be worse than an issue during a slow period. This is intrinsically known as many companies schedule maintenance windows for late at night or early in the morning, periods where the impact of disruption is smaller.
- The business communicates in business terms (revenue, cost, margin, return on investment) and these terms are measured in dollars, not clock time.
- Using the uptime figure from a server or other infrastructure component as an availability measure is inaccurate because it does not capture software bugs or other issues rendering your service inoperative despite the server uptime status.
Now that we’ve established that availability should be measured and that clock time is not the best unit of measure, what is a better choice? Transactional metrics aligned to the desired business outcome are the better choice.
- Rates – log transactional rates such as logins, add to cart, registration, downloads, orders, etc. Apply Statistical Process Control or other analysis methods to establish thresholds indicating an unusual deviation in the transaction rate.
- Ratios – the proportion of undesired or failed outcomes such as failed logins, abandoned shopping carts, and HTTP 400s can be useful for measuring the quality of service. Analysis of such ratios will establish unusual deviation levels.
- Patterns – transaction patterns can identify expected activity, such as order rates increasing when an item is first available for sale or download rates increasing in response to a viral social media video. The absence of an expected pattern change can signal an availability issue with your product or service.
Alignment with Desired Outcomes
What are the goals of your business? What is your value proposition? Choose metrics that comprehensively measure the availability of your product or service. The ability of a customer to buy a product from your website (login, search, add to cart, and check out). The proportion of file downloads successfully completed in less than 4 seconds. The success rate of posting a message to a social media platform and the ability of others to view it. Measuring availability with metrics aligned with the desired outcomes keeps the big picture at the forefront and helps business colleagues understand how the technology team contributes to value creation.
Not measuring availability is bad. Measuring it in clock time is better, but still leaves something to be desired. Measuring availability with transactional metrics tied to the desired business outcome is best. Don’t settle for better when you can be best.
Interested in learning more? Struggling with analyzing data? Unsure of how to apply architectural principles to achieve higher availability? Contact us, we’ve been in your shoes.
(Image Credit: Sarah Pflug from Burst)
August 22, 2019 | Posted By: AKF
For over a decade, AKF has been on a number of engagements where we have seen technology organizations put off a large portion of engineering effort wrangling with technical debt. Technical due diligence, as laid out in Technology Due Diligence Checklist, should help identify the amount of technical debt and quantify the amount of engineering resources dedicated to servicing the debt.
What is Technical Debt?
Technical debt is the difference between doing something the desired or best way and doing something quickly. Technical debt is a conscious choice, made knowingly, and with commission to take a shortcut in the technology arena – the delta between the desired or intended way and quicker way. The shortcut is usually taken for time to market reasons and is a sound business decision within reason.
Is Technical Debt Bad?
Technical debt is analogous in many ways to financial debt – a complete lack of it probably means missed business opportunities while an excess means disaster around the corner. Just like financial debt, technical debt is not necessarily bad.
Accruing some debt allows the technology organization to release a minimal viable product to customers, just as some financial debt allows a company to start new investments earlier than capital growth would allow.
Too little debt can result in a product late to market in a competitive environment and too much debt can choke business innovation and cause availability and scalability issues later in life. Tech debt becomes bad when the engineering organization can no longer service that debt.
Technical Debt Maintenance
Similar to financial debt, technical debt must be serviced, and it is serviced by the efforts of the engineering team. A failure to service technical debt will result in high-interest payments as seen by slowing time to market for new product initiatives post-investment.
Our experience indicates that most companies should expect to spend 12% to 25% of engineering effort on servicing technical debt. Whether that resource allocation keeps the debt static, reduces it, or allows it to grow depends upon the amount of technical debt and also influences the level of spend. It is easy to see how a company delinquent in servicing their technical debt will have to increase the resource allocation to deal with it, reducing resources for product innovation and market responsiveness.
Technical Debt Takeaways:
Choosing to take on tech debt by delaying attention to address technical issues allows greater resources to be focused on higher priority endeavors
The absence of technical debt probably means missed business opportunities – use technical debt as a tool to best meet the needs of the business
Excessive technical debt will cause availability and scalability issues, and can choke business innovation (too much engineering time dealing with debt rather than focusing on the product)
The interest of tech debt is the difficulty or increased level of effort in modifying something in subsequent releases
The principal of technical debt is the difference between desired and actual quality or features in a service or product
Technology resources to continually service technical debt should be clearly planned in product road maps - 12 to 25% is suggested
AKF’s Technical Due Diligence can discover a team’s ability to quantify the amount of debt accrued and the engineering effort to service the debt. Contact us, we can help!
August 21, 2019 | Posted By: Bill Armelin
At AKF Partners, we believe in learning aggressively, not just from your successes, but also your failures. One common failure we see are service disrupting incidents. These are the events that either make your systems unavailable or significantly degrade performance for your customers. They result in lost revenue, poor customer satisfaction and hours of lost sleep. While there are many things we can do to reduce the probability of an incident occurring or the impact if it does happen, we know that all systems fail.
We like to say, “An incident is a terrible thing to waste.” The damage is already done. Now, we need to learn as much about the causes of the incident to prevent the same failures from happening again. A common process for determining the causes of failure and preventing them from reoccurring is the postmortem. In the Army, it is called an After-Action Review. In many companies it is called a Root Cause Analysis. It doesn’t matter what you call it, as long as you do it.
We actually avoid using Root Cause Analysis. Many of our clients that use the term focus too much on finding that one “root cause” of the issue. There will never be a single cause to an incident. There will always be a chain of problems with a trigger or proximate event. This is the one event that causes the system to finally topple over. We need a process that digs into the entire chain of events inclusive of the trigger. This is where the postmortem comes in. It is a cross-functional brainstorming meeting that not only identifies the root causes of a problem, but also help in identifying issues with process and training.
Postmortem Process – TIA
The purpose of a good postmortem is to find all of the contributing events and problems that caused an incident. We use a simple three step process called TIA. TIA stands for imeline, ssues, and ctions.
First, we create a timeline of events leading up the issue, as well as the timeline of all the actions taken to restore service. There are multiple ways to collect the timeline of events. Some companies have a scribe that records events during the incident process. Increasingly, we are seeing companies use chat tools like Slack to record events related to restoration. The timestamp in Slack for the message is a good place to extract the timeline. Don’t start your timeline at the beginning of the incident. It starts with the activities prior to the incident that cause the triggering event (e.g. a code deployment). During the postmortem meeting, augment the timeline with additional details.
The second part of TIA is Issues. This is where we walkthrough the timeline and identify issues. We want to focus on people, process, and technology. We want to capture all of the things that either allowed the incident to happen (e.g. lack of monitoring), directly triggered it (e.g. a code push), or increased the time to restore the system to a stable state (e.g. could get the right people on the call). List each issue separately. At this point, there is no discussion about fixing issues, we only focus on the timeline and identifying issues. There is also no reference to ownership. We also don’t want to assign blame. We want a process that provides constructive feedback to solve problems.
Avoid the tendency to find a single triggering event and stop. Make sure you continue to dig into the issues to determine why things happened the way they did. We like to use the “5-whys” methodology to explore root causes. This entails repeatedly asking questions about why something happened. The answer to one question becomes the basis for the next. We continue to ask why until we have identified the true causes of the problems.
Here is a summary of anti-patterns we see when companies conduct postmortems:
|Not conducting a postmortem after a serious (e.g. Sev 1) incident
||Conduct a postmortem within a week after a serious incident
||Avoid blame and keep it constructive
|Not having the right people involved
||Assemble a cross functional team of people involved or needed to resolve problems
|Using a postmortem block (e.g. multiple postmortems during a 1-hour session every two weeks)
||Dedicate time for a postmortem based on the severity of the incident
|Lack of ownership of identified tasks
||Make one person accountable to complete a task within an appropriate timeframe
|Not digging far enough into issues (finding a single root cause)
||Use the 5-Why methodology to identify all of the causes for an issue
Incidents will always happen. What you do after service restoration will determine if the problem occurs again. A structured, timely postmortem process will help identify the issues causing outages and help prevent their reoccurrence in the future. It also fosters a culture of learning from your mistakes without blame.
Are you struggling with the same issues impacting your site? Do you know you should be conducting postmortems but don’t know how to get started? AKF can help you establish critical incident management and postmortem processes. Call us – we can help!
August 20, 2019 | Posted By: Dave Berardi
If your company doesn’t utilize one of the big cloud providers for either IaaS or PaaS as part of product infrastructure, it’s only a matter of time. We often find our clients in situations where they are pressured to move quickly for benefit-realization to improve many aspects of their business.
Drivers of this trend that exist across our client base and the industry include:
- The Need For Speed and Time To Market: The need to scale capacity quickly without waiting weeks or months for hardware procurement and provisioning in your own datacenter or colo.
- Traditional On-Prem Software Dying by 1000 Cuts: Demand-side (buyer) forces are encouraging companies to get services and software out of data centers. Cloud-native SaaS competition is pressuring what’s left of the on-prem software providers.
- Legacy Company Talent Challenges: The inability of the old economy companies to hire engineering talent to support on-prem software in house.
Several different approaches can be used for migration. We’ve seen many of them and there are two on opposite ends of the spectrum – Lift and Shift and Cloud-Native – that we want to unpack.
The Lift and Shift Approach:
What is it?
Put simply, this is when the same architecture, resources, and services from an on-prem or colo data center are moved up into a cloud provider. Often VMs from on-prem hosting centers are converted and dropped into reserved virtual compute instances. Tools such as AWS Connector for vCenter or GCP’s Velostrata, in theory, allow for an easy transition.
- Fastest path to cloud
- Same architecture and tech stack minimizes training need – infrastructure management does require knowledge of the console
- Least costly in terms of planning, architecture changes, refactoring
- Monolithic nature of the architecture can prove to be costly thru BYOL and compute requirements
- Minimal use of native elasticity and resources create cost-inefficient use of compute, memory, and storage and may not perform as needed
- Technical debt migrates with the product and cost could be magnified with additional problems and a shift to the pay for use model
While Lift and Shift seems to be the easiest path, you need to be aware of the strong potential for an increase in cost in the cloud. Running VMs in your own DC and colo masks the cost inefficiencies since they are all part of Capex for your compute, storage, and network. When you move to public cloud the provider will promise to be cheaper. But in the cloud you will pay for every reserved CPU that isn’t utilized, storage that isn’t used, and other idle resources. Further, your availability can only be as good as the provider’s uptime for a given Region and/or Availability Zone.
Cloud Native Approach:
What is it?
Cloud-Native approach ultimately allows for the use of a provider’s cloud services as long as there are requests and demand being created by product users. This approach almost always requires investment into splitting the monolith and moving to a services-separated architecture. In addition, it could require you to use native services in your provider of choice. Doing so lets you move from paying for provisioned infrastructure to consumption-based services with better cost-efficiency.
- Less time needed to manage infrastructure and more time for features and experimentation
- Easier to scale out using native services
- Most cost-efficient
- Slowest path to cloud
- More discovery and training - this approach requires your teams to understand the current tech stack in order to recreate them in cloud. From a cloud perspective they must understand how the provider of choice works so that decisions can be made on native services.
- Increased risk of vendor lock-in (eg. Building out event-driven services with rules inside of native serverless)
The Cloud Native path is a longer one, but provides several benefits that will yield more value over time. With this approach you must spend time determining how to split up your monolith and how to best leverage the right combination of Availability Zones, Regions, and use of native services depending on your Recovery Time Objective (RTO) and Recovery Point Objectives (RPO). We prefer to solve scalability and availability problems with systems and software architecture to avoid vendor lock-in. All of the trade-offs on such a journey must be understood.
We have helped several companies of various sizes move to the cloud going thru SaaS transformations and have engaged in reviewing proposed architectures. Contact us to see how we can help.
August 7, 2019 | Posted By: Pete Ferguson
Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems. While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize. Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.
We see more and more new startups in AWS, Google, and Azure – in addition to assisting well-established companies make the transition to the cloud. Regardless of the hosting platform, in our technical due diligence reviews, we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)
This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.
Scalability Rules – Chapter 2: Distribute Your Work
Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.
So how did they do it? ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.
“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners & former VP of Architecture at eBay & Service Now)
The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.
At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability. The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube). Sometimes, it’s easier to see these three axes without the confined space of the cube.
X Axis – Horizontal Duplication
The X Axis allows transaction volumes to increase easily and quickly. If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.
A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer. This cloning allows the distribution of transactions across systems evenly for horizontal scale. Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed. Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time. To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.
Y Axis – Split by Function, Service, or Resource
Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy. The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.
In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption. This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.
Y Axis splits also apply to a noun approach. Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.
While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases. Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well. This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.
Z Axis – Separate Similar Things
Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology. Z Axis separation is useful for containerizing customers or a geographical replication of data. If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geography, or other subset of data.
This means that each larger customer or geography could have its own dedicated Web, application, and database servers. Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.
For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches. This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.
Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets. In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.
This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented. We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands). Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!
1 2 3 > Last ›