GROWTH BLOG: AKF Partners Announces European Expansion
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

The Best Definition of Done for Agile

October 24, 2019  |  Posted By: Marty Abbott

Many of our clients struggle with ensuring that the solutions they create meet the needs of both their business and their customers.  The exact symptoms of the above failure vary between clients, and are best explained using anonymous quotes:

We make an initial release of a solution and we never return to it – it feels like it just doesn’t get ‘done’

We rarely measure the success or failure of the solutions we create.  Instead we look at business performance – we either hit our numbers or we didn’t.  If we don’t hit the numbers, it’s the fault of engineering and product.  If we hit our numbers, sales gets credit

I don’t get it – we get a lot of velocity out of our engineering team, but we always have availability problems

A likely cause

The above quotes all point to very different symptoms of a common problem.  One discusses level of investment in the product and bringing it to maturity, one to a lack of value measurement and allocation of recognition, and the last to the availability and quality of the solutions that the team creates.  The most common problem for each of these symptoms in our experience is that the Agile “Definition of Done” is undefined, implicitly defined or most commonly incompletely defined.

The purpose of “Done”

“Done” is comprised of a number of criteria to be met before any element of work (e.g. a story) is considered “complete” and can be counted for the purposes of velocity.
The benefits of “Done”
• Defining done removes ambiguity and uncertainty, and forces developers, product owners, and scrum team members to align on standards for completion.
• Helps foster good and efficient discussion around paths, tradeoffs, and completion during standups.
• If employed consistently between teams, helps to align and make useful velocity related metrics.
• Limits rework, or post-sprint work for important elements of the solution in question.
• When combined with velocity, incorporates a notion of “earned value” – incenting teams to completion of a solution rather than the typical “effort expended” accounted for at the actual expense of that effort.  Put another way, teams don’t get credit for work until something has been completed.

A typical generic definition of “Done”

A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has all necessary support and end user documentation complete
• Has been reviewed by the product owner

Necessary but Insufficient Definition

The above definition, while necessary, is incomplete and therefore insufficient for the needs of a company.  The definition fails to account for:
• Business value creation (is it really “done” if we don’t achieve the desired results”?)
• Non-functional requirements necessary to produce value such as availability, scalability, response time, cost of maintenance (cost of operations or goods sold), etc.
As a result, because metrics tied against value creation are not available, attribution for credit of results becomes a subjective process.

Towards a Better Definition

Given the above, a better definition of done should include:
• Non-functional requirements necessary to achieve value creation
• Evaluation that the solution achieves some desired result that is ideally also incorporated into the stories themselves (some measurement our outcome the effort is to achieve)
Modifying the prior definition, we might now have:
A solution is done when it:
• Is implemented to standards
• Has been code reviewed consistent with standards
• Has automated unit tests created to the unit coverage standard (70+%)
• Has passed automated integration testing and all other continuous integration checks
• Has delivered all necessary support and end user documentation
• Has been reviewed by the product owner
• Meets the response time objective to end users at peak traffic
• Meets one week of availability target and has passed an availability review
• Meets the cost of goods sold target after one week for infrastructure or IaaS costs
• Meets all other company NFRs (above are examples)
• Shows progress towards or achieves the business metrics it was meant to achieve (may be none for a partial release, or full metrics for the completion of an epic)

Who is Responsible for Evaluating “Done”

This is an agile process, so the team is responsible for their own measurement.  In most teams this means the PO and Scrum Master.  As a business we also “trust and verify”, so leaders should double-check that value metrics have indeed been met in normal operations reviews. 

The Cons of the “Right Definition of Done”

The largest impact is to the time one can realize the earned value component of velocity.  Here we’ve been careful to say that the bound is one week after delivery such that velocity is just pushed out by a week to allow for evaluation in production.  In addition, there is a bit more record keeping (ostensibly for a scrum master and product owner to evaluate) but the cost of that is incredibly low relative to the alignment to business objectives and customer needs.

Keeping the Old Velocity Metric

There is some value in understanding what gets completed as well as what is truly “done”.  If this is the case, just track both velocities.  Call one “release velocity” and the other “done velocity” or “value velocity”.  The overhead is not that high – scrum masters should easily be able to do this.  Now you’ll have metrics to help you understand the gap between what you release first time versus when something finally creates value.  This gap is as useful for problem identification as “find-fix” charts in evaluating completion of quality assurance checks.

The Biggest Reason for The Right Definition of Done

Hopefully the answer to this is somewhat obvious:  By changing the definition of done, we align ourselves to both our customer and business needs.  It helps engineers focus on customer outcomes – rather than just how something should “work”.  Engineers too often focus on a problem from their perspective forward

rather than from the customer needs backwards.

Forcing architects to think back from the customer rather than forward from the engineer helps solve problems associated with response time and availability (some of the NFRs above).

Subscribe to the AKF Newsletter

Contact Us

The 5-95 Rule

September 26, 2019  |  Posted By: Marty Abbott

Picture of short term paved vs long term off road
The Problem – Too Much Planning, Too Little Execution

How many of you spend a significant portion of your year planning for the next year, two years, or five years of activities?  How often are these plans useful beyond the first three to four months of execution?

We have many large clients who will begin one, two ,or (gasp!) five-year year plans in July or August of the current year.  They spend a significant amount of effort creating these plans over a five-to-six-month span of time.  The plans are often very specific as to what they will do; what projects they will deliver, what products they will create, how many people they will hire, what training their teams will undertake, etc.  The plan is typically well followed in month 1 and 2 and starts to degrade significantly in month 3.  By month 6, just before they start the next annual planning cycle, the original plan is at best 50% accurate; the original projects have been replaced, new market intelligence has informed different product solutions, new skills and different teammates are needed, etc.

Hurricanes always have an associated cone of uncertainty.  The current position of the hurricane and current direction and velocity are well known.  But several factors may cause the hurricane to act differently an hour or a day from now than it is behaving at exactly this moment.  The same is true with businesses.  We know what we need to do today to maintain our position or gain market share, but those activities may change in priority and number in the next handful of months.

So why do we spend so much time on solutions and approaches when at best 25% of the plan we produce is accurate?  We don’t have to waste time as we do today, there is a better way.

Financial vs Operational Plans

First, let’s acknowledge that there is a difference between a financial plan (how much we will spend as a company and what we expect to make as a return on that spend) and an operational plan.  The board of directors for your company has a fiduciary responsibility to exercise, in non-expert legal terms:

  1. A duty of loyalty – the director must put the interests of the institution and its shareholders before his or her own.
  2. A duty of care – the director must behave prudently, diligently and with skill.
  3. A duty of obedience – the director must ensure consistency with the purpose of the company – and in a for-profit company, this means ensuring profitability.

Any board of directors, to ensure they are consistent with the law, will require a financial plan.  At the very least, they need to govern the spend of the company relative to its revenues to ensure profitability, and ideally over time, an increase in profitability.  But that does not mean they need to go into great detail regarding the exact path and actions to achieve the financial plan.  We all likely agree that we also have a duty of loyalty, care, and obedience to ourselves and our families – but how many of us go beyond creating an annual budget (financial plan) for any given year?

One of the best known and most successful directors and investors of all time, Warren Buffett, has what amounts to be (in another authors terms) a list of “10 Commandments” for boards and directors.  A quick scan of these makes it clear that Buffet’s perspective for board’s focus should be on the performance of the CEO and the company itself – not the detailed operational plan to achieve a financial plan.  The board does arguably need to ensure that a strategy exists and is viable – but a strategy need not be a list of tasks for every subordinate organization for an entire year.  In fact, given the arguments above, such a task list (or deep operational plan) won’t be followed past a handful of months anyway.

The Fix: Reduce Planning, Increase Execution

If the problem is too much time wasted creating plans that are good for only a short period of time, the fix should be obvious.  For this, we offer the AKF 5-95 Rule:  spend 5 percent of your time planning and 95% of your time executing.  This stands in stark contrast to the “Soviet-esque” way in which many companies operate with executives spending as much as 25 percent of a year involved in financial and execution plans. 

  1. Decrease the horizon (focused endpoint) of planning and decrease the specificity of plans. Take a portion of the five percent of your total time and create a good financial plan of what you would like to achieve.  The remainder should be used to iteratively identify the short-term paths to achieve that plan using windows no greater than 3 months.  Anything beyond 3 months has a high degree of waste.
  2. Adopt development methodologies that maximize execution value.  Adopt Agile development methodologies meant to embrace low levels of specificity and rely on discovery to identify the “right solution” to maximize market adoption. 

The best way to maximize the AKF 5-95 rule is to implement OKRs – (O)bjectives and (K)ey (R)esults as a business, the bowling alley methodology of product focus and Agile product development practices .

Subscribe to the AKF Newsletter

Contact Us

Focus Versus Agility in Business, Product Management and Product Development

September 16, 2019  |  Posted By: Marty Abbott

Focus vs Agility
Two of the most common statements we hear from our clients are:

Business: “Our product and engineering teams lack the agility to quickly pivot to the needs of the business”.

Product and Engineering: “Our business lacks the focus and discipline to complete any initiative.  We are subject to the ‘Bright Shiny Object (BSO’ or ‘Squirrel!’ phenomenon”.

These two teams seem to be at an impasse in perspective requiring a change by one team or the other for the company to be successful.

Not true.

Companies need both focus and agility to be successful.  While these two concepts may appear to be in conflict, a team need only three things to break the apparent deadlock:

  1. Shared Context.
  2. Shared agreement as to the meaning of some key terms.
  3. Three process approaches across product, the business, and engineering. 

First, let’s discuss a common context within which successful businesses in competitive environments operate.  Second, we’ll define a common set of terms that should be agreed upon by both the business and engineering.  Finally, we’ll dig into the approaches necessary to be successful.

Business Context

Successful businesses operating within interesting industries attract competition.  Competitors seek innovative approaches to disrupt each other and gain market share within the industry.  Time to market (TTM) in such an environment is critical, as the company that finds an approach (feature, product, etc.) to shift or gain market share has a compelling advantage for some period.  As such, any business in a growth industry must be able to move and pivot quickly (be agile) within its product development initiatives.  Put another way, businesses that can afford to stick to a dedicated plan likely are not in a competitive or growing segment, probably don’t have competition, and aren’t likely attractive to investors or employees.

Important Terms

Power of Focus - A man's face clearly with intense focus
The focus that matters within business is a focus on outcomes.  Why focus on outcomes instead of the path to achieve them?  Focusing on a path implies a static path, and when is the last time you saw a static path be successful? (Hint:  most of us have never seen a static path be successful).  Obviously, sometimes outcomes need to change, and we need a process by which we change desired outcomes.  But outcomes should change much less frequently than path. 

Agility enables changing directions (paths) to achieve focused outcomes.  ‘Nuff said.

Key Approaches

Commonly known as (O)bjectives and (K)ey (Results), or in AKF parlance Outcomes and Key Results, OKRs are the primary mechanism of focus while allowing for some level of agility in changing outcomes for business needs.  Consider the O (objectives or outcomes) as the thing upon which a company is focused, and the Key Results as the activities to achieve those outcomes.  KRs should change more frequently than the Os as companies attempt to define better activities to achieve the desired outcomes.  An objective/outcome could be “Improve Add-To-Cart/Search ratio by 10%”. 

Each objective/outcome should have 3 to 5 supporting activities.  For the add-to-cart example above, the activities may implement personalization to drive 3% improvement, add re-targeting for a net 4% improvement, and improve descriptive meta-tags in search for a 3% improvement. 

OKRs help enforce transparency across the organization and help create the causal roadmap to success.  Subordinate organizations understand how their initiatives nest into the high-level company objectives by following the OKR “tree” from leave to root.  By adhering to a strict and small number of high-level objectives, the company creates focus.  When tradeoffs must happen, activities not aligned with high level objectives get deprioritized or deferred.

Bowling Alley: 
Geoffrey Moore outlines an approach for product organizations to stay focused in their product development efforts.  When combined with the notion of a Minimum Viable Product the approach is to stay focused on a single product, initially small, focused on the needs of the pioneers within the technology adoption lifecycle (TALC) for a single target market or industry.

The Technology Adoption Lifecycle Graph

The single product for a single industry (P1T1) or need is the headpin of the bowling alley.  The company maintains focus on this until such time as they gain significant adoption within the TALC – ideally a beachhead in the early majority of the TALC. 

Bowling Alley Headpin - small product, small market

Only after significant adoption through the TALC (above) does the company then introduce the existing product to target market 2 (P1T2) and begin work on product 2 (or significant extension of product 1) in target market 1 (P2T1).

Bowling Alley 2d Pin - same product new small market

Bowling Alley 3d Pin - old market, larger or new product

Agile Methods: 
While OKRs and the Bowling Alley help create focus, Agile product methodologies help product and engineering teams maintain flexibility and agility in development.  Epics and stories map to key results within the OKR framework.  Short duration development cycles help limit the loss in effort associated with changing key results and help to provide feedback as to whether the current path is likely to meet the objectives and key results within OKRs.  Backlogs visible to any Agile team are deep enough to allow for grooming and sizing, but shallow enough such that churn and the resulting morale impact do not jeopardize the velocity of development teams.

Putting it all together:

There is no discrepancy between agility and focus if you:

  • Agree to shared definitions of both agility and focus per above
  • Jointly agree that both agility and focus are necessary
  • Implement OKRs to aid with both agility and focus
  • Employ an Agile methodology for product and product development
  • Use the TALC in your product management efforts and to help enforce focus on winning markets

Subscribe to the AKF Newsletter

Contact Us

Architectural Principles: Build vs. Buy

September 6, 2019  |  Posted By: Pete Ferguson

build vs buy graphic
In many of our technical due diligence engagements, it is common to find that companies are building tools with considerable development effort (and ongoing maintenance) for something that is not part of their core strength and thus providing a competitive advantage. What criteria does your organization us in deciding when to build vs. buy?

If you perform a simple web search for “build vs. buy” you will find hundreds of articles, process flows, and decision trees on when to build and when to buy. Many of these are cost-centric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.

Buy When Non-Core – If you aren’t the best at building it and it doesn’t offer a competitive differentiation, buy it. Regardless of how smart you and your team are – you simply aren’t the best at everything

We have many examples from our customers developing load balancing software, building their own databases, etc. In nearly every case, a significant percentage of the engineering team (and engineering cost) go into a solution that:

  1. Does not offer long term competitive differentiation
  2. Costs more than purchasing an existing product
  3. Steals focus away from the engineering team
  4. Is not aligned with the skills or business level outcomes of the team

If You Can’t Beat Them - Join Them
(or buy, rent, or license from them)

Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:

Shiny object distraction is a very real thing we observe regularly. Companies start - innocently enough - building a custom tool in a pinch to get them by, but never go back and reassess the decision. Over time the solution snowballs and consumes more and more resources that should be focused on innovating strategic differentiation.

  • We have yet to hear a tech exec say “we just have too many developers, we aren’t sure what to do with them.”
  • More often than not “resource constraints” is mentioned within the first few hours of our engagements.
  • If building instead of buying is going to distract from focusing efforts on the next “big thing” – then 99% of the time you should just stop here and attempt to find a packaged product, open-source solution, or outsourcing vendor to build what you need.

If after reviewing these points, if the answer is “Yes, it will provide a strategic differentiation,” then proceed to question 2.

This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else.

For instance, if you are a social networking site, you probably don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”.

And please, don’t fool yourself – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?

We can differentiate with our own solution, but don’t need to build everything <br />
(e.g. shopping cart or reviews)

We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision.

  • If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity.
  • Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time.
  • A “build” decision today will look bad tomorrow as features converge and pricing declines.

If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).


  • Is it cheaper to build than buy when considering the total lifecycle (implementation through end-of-life) of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc
  • If your business REALLY grows and is extremely successful, do you want to be continuing to support internally-developed monitoring and logging solutions, mobile architecture, payments, etc. through the life of your product?

Don’t fool yourself into answering this affirmatively just because you want to work on something “neat.” Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.

There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing,” but we feel these four questions are sufficient for most cases.

A “build” decision is indicated when the answers to all 4 questions are “Yes.”

We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a concern) anytime you answer “No” to any question above.


While startups and small companies roll their own tools early on to get product out the door, as they grow, the timeline of planning (and related costs) needs to increase from the next sprint to a longer-term annual and multi-year strategy. That, plus growth, tips the scale to buy instead of build. The more internal products produced and supported, the more tech debt is required and distracts medium-to-large organizations from competing against the next startup.

While building custom tools and products seems to make sense in the immediate term, looking at the long-term strategy and desired outcome of your organization needs to be fully-weighted in the decision process. Distraction from focus is the number one harm we have seen many times with our clients as they fall behind the competition and burn sprint cycles on maintaining products that don’t move the needle with their customers. The crippling cost of distractions is what causes successful companies from losing their competitive advantage as well as slipping into oblivion.

Like the ugly couch your auntie gave you for your first apartment, it can often be difficult to assess what makes sense without an outside opinion. Contact us, we can help!

Subscribe to the AKF Newsletter

Contact Us

Measuring Availability

September 2, 2019  |  Posted By: Greg Fennewald

tape measurer graphic measuring the word availability
As a company matures from a startup to a growing business, there are a number of measurables that become table stakes – basic tools for managing a business.  These measurables include financial reporting statements, departmental budgets, KPIs, and OKRs.  Another key measurable is the availability of your product or service and this measurable should be owned by the technology team.

When we ask clients about availability goals or SLAs, some do not have it documented and say something along the lines of “we want our service to always be available”.  While a nice sentiment, unblemished availability is virtually impossible to achieve and prohibitively expensive to pursue.  Availability goals must be relevant to the shared business outcomes of the company.

If you are not measuring availability, start.  If nothing else, the data will inform what your architecture and process can do today, providing a starting point if the business chooses to pursue availability improvements.

Some clients who do have an availability measurable use a percentage of clock time – 99.95% for example.  This is certainly better than no measurable at all, but still leaves a lot to be desired.

Reasons why clock time is not the best measure for availability:

  • Units of time are not equal in terms of business impact – a disruption during the busiest part of the day would be worse than an issue during a slow period.  This is intrinsically known as many companies schedule maintenance windows for late at night or early in the morning, periods where the impact of disruption is smaller.
  • The business communicates in business terms (revenue, cost, margin, return on investment) and these terms are measured in dollars, not clock time.
  • Using the uptime figure from a server or other infrastructure component as an availability measure is inaccurate because it does not capture software bugs or other issues rendering your service inoperative despite the server uptime status.

slide showing outage of equal time plotted against company revenue at time of outage

Now that we’ve established that availability should be measured and that clock time is not the best unit of measure, what is a better choice?  Transactional metrics aligned to the desired business outcome are the better choice.

Transactional Metrics

  • Rates – log transactional rates such as logins, add to cart, registration, downloads, orders, etc.  Apply Statistical Process Control or other analysis methods to establish thresholds indicating an unusual deviation in the transaction rate.
  • Ratios – the proportion of undesired or failed outcomes such as failed logins, abandoned shopping carts, and HTTP 400s can be useful for measuring the quality of service.  Analysis of such ratios will establish unusual deviation levels.
  • Patterns – transaction patterns can identify expected activity, such as order rates increasing when an item is first available for sale or download rates increasing in response to a viral social media video.  The absence of an expected pattern change can signal an availability issue with your product or service.

Alignment with Desired Outcomes

What are the goals of your business?  What is your value proposition?  Choose metrics that comprehensively measure the availability of your product or service.  The ability of a customer to buy a product from your website (login, search, add to cart, and check out).  The proportion of file downloads successfully completed in less than 4 seconds.  The success rate of posting a message to a social media platform and the ability of others to view it.  Measuring availability with metrics aligned with the desired outcomes keeps the big picture at the forefront and helps business colleagues understand how the technology team contributes to value creation.


Not measuring availability is bad.  Measuring it in clock time is better, but still leaves something to be desired.  Measuring availability with transactional metrics tied to the desired business outcome is best.  Don’t settle for better when you can be best.

Interested in learning more?  Struggling with analyzing data?  Unsure of how to apply architectural principles to achieve higher availability?  Contact us, we’ve been in your shoes.

(Image Credit: Sarah Pflug from Burst)

Subscribe to the AKF Newsletter

Contact Us

Learning from Failure: Conducting Postmortems

August 21, 2019  |  Posted By: Bill Armelin

picture of a detective looking through a magnifying glass

At AKF Partners, we believe in learning aggressively, not just from your successes, but also your failures. One common failure we see are service disrupting incidents. These are the events that either make your systems unavailable or significantly degrade performance for your customers. They result in lost revenue, poor customer satisfaction and hours of lost sleep. While there are many things we can do to reduce the probability of an incident occurring or the impact if it does happen, we know that all systems fail.

We like to say, “An incident is a terrible thing to waste.” The damage is already done. Now, we need to learn as much about the causes of the incident to prevent the same failures from happening again. A common process for determining the causes of failure and preventing them from reoccurring is the postmortem. In the Army, it is called an After-Action Review. In many companies it is called a Root Cause Analysis. It doesn’t matter what you call it, as long as you do it.

We actually avoid using Root Cause Analysis. Many of our clients that use the term focus too much on finding that one “root cause” of the issue. There will never be a single cause to an incident. There will always be a chain of problems with a trigger or proximate event. This is the one event that causes the system to finally topple over. We need a process that digs into the entire chain of events inclusive of the trigger. This is where the postmortem comes in. It is a cross-functional brainstorming meeting that not only identifies the root causes of a problem, but also help in identifying issues with process and training.

Postmortem Process – TIA

The purpose of a good postmortem is to find all of the contributing events and problems that caused an incident. We use a simple three step process called TIA. TIA stands for Timeline, Issues, and Actions.

AKF postmortem process

First, we create a timeline of events leading up the issue, as well as the timeline of all the actions taken to restore service. There are multiple ways to collect the timeline of events. Some companies have a scribe that records events during the incident process. Increasingly, we are seeing companies use chat tools like Slack to record events related to restoration. The timestamp in Slack for the message is a good place to extract the timeline. Don’t start your timeline at the beginning of the incident. It starts with the activities prior to the incident that cause the triggering event (e.g. a code deployment). During the postmortem meeting, augment the timeline with additional details.

The second part of TIA is Issues. This is where we walkthrough the timeline and identify issues. We want to focus on people, process, and technology. We want to capture all of the things that either allowed the incident to happen (e.g. lack of monitoring), directly triggered it (e.g. a code push), or increased the time to restore the system to a stable state (e.g. could get the right people on the call). List each issue separately.  At this point, there is no discussion about fixing issues, we only focus on the timeline and identifying issues. There is also no reference to ownership. We also don’t want to assign blame. We want a process that provides constructive feedback to solve problems.

Avoid the tendency to find a single triggering event and stop. Make sure you continue to dig into the issues to determine why things happened the way they did. We like to use the “5-whys” methodology to explore root causes. This entails repeatedly asking questions about why something happened. The answer to one question becomes the basis for the next. We continue to ask why until we have identified the true causes of the problems.

AKF postmortem process
AKF postmortem process

Postmortem Anti-Patterns

Here is a summary of anti-patterns we see when companies conduct postmortems:

Anti-PatternBest Practice
Not conducting a postmortem after a serious (e.g. Sev 1) incidentConduct a postmortem within a week after a serious incident
Assigning blameAvoid blame and keep it constructive
Not having the right people involvedAssemble a cross functional team of people involved or needed to resolve problems
Using a postmortem block (e.g. multiple postmortems during a 1-hour session every two weeks)Dedicate time for a postmortem based on the severity of the incident
Lack of ownership of identified tasksMake one person accountable to complete a task within an appropriate timeframe
Not digging far enough into issues (finding a single root cause)Use the 5-Why methodology to identify all of the causes for an issue

Incidents will always happen. What you do after service restoration will determine if the problem occurs again. A structured, timely postmortem process will help identify the issues causing outages and help prevent their reoccurrence in the future. It also fosters a culture of learning from your mistakes without blame.

Are you struggling with the same issues impacting your site? Do you know you should be conducting postmortems but don’t know how to get started? AKF can help you establish critical incident management and postmortem processes. Call us – we can help!

Subscribe to the AKF Newsletter

Contact Us

Top 3 Failures in Digital Transformations

July 11, 2019  |  Posted By: Marty Abbott

Attempting to transform a company to compete effectively in the Digital Economy is difficult to say the least.  In the experience of AKF Partners, it is easier to be “born digital” than to transform a successful, long tenured business, to compete effectively in the Digital age. 

There is no single guaranteed fail-safe path to transformation.  There are, however, 10 principles by which you should abide and 3 guaranteed paths to failure. 

Avoid these 3 common mistakes at all costs or suffer a failed transformation.

Top 3 Digital Transformation Failures

Having the Wrong Team and the Wrong Structure

If you have a successful business, you very likely have a very bright and engaged team.  But unless a good portion of your existing team has run a successful “born digital” business, or better yet transformed a business in the digital age, they don’t have the experience necessary to complete your transformation in the timeframe necessary for you to compete.  If you needed lifesaving surgery, you wouldn’t bet your life on a doctor learning “on the job”.  At the very least, you’d ensure that doctor was alongside a veteran and more than likely you would find a doctor with a successful track record of the surgery in question.  You should take the same approach with your transformation.

This does not mean that you need to completely replace your team.  Companies have been successful with organization strategies that include augmenting the current team with veterans.  But you need new, experienced help, as employees on your team. 

Further, to meet the need for speed of the new digital world, you need to think differently about how you organize.  The best, fastest performing Digital teams organize themselves around the outcomes they hope to achieve, not the functions that they perform.  High performing digital teams are

It also helps to hire a firm that has helped guide companies through a transformation.  AKF Partners can help. 

Planning Instead of Doing

The digital world is ever evolving.  Plans that you make today will be incorrect within 6 months.  In the digital world, no plan survives first contact with the enemy.  In the old days of packaged software and brick and mortar retail, we had to put great effort into planning to reduce the risk associated with being incorrect after rather long lead times to project completion.  In the new world, we can iterate nearly at the speed of thought.  Whereas being incorrect in the old world may have meant project failure, in the new world we strive to be incorrect early such that we can iterate and make the final solution correct with respect to the needs of the market.  Speed kills the enemy.

Eschew waterfall models, prescriptive financial models and static planning in favor of Agile methodologies, near term adaptive financial plans and OKRs.  Spend 5 percent of your time planning and 95% of your time doing.  While in the doing phase, learn to adapt quickly to failures and quickly adjust your approach to market feedback and available data. 

The successful transformation starts with a compelling vision that is outcome based, followed by a clear near-term path of multiple small steps.  The remainder of the path is unclear as we want the results of our first few steps to inform what we should do in the next iteration of steps to our final outcome.  Transformation isn’t one large investment, but a series of small investments, each having a measurable return to the business.

Knowing Instead of Discovering

Few companies thrive by repeatedly being smarter than the market.  In fact, the opposite is true – the Digital landscape is strewn with the corpses of companies whose hubris prevented them from developing the real time feedback mechanisms necessary to sense and respond to changing market dynamics.  Yesterdays approaches to success at best have diminishing returns today and at worst put you at a competitive disadvantage.

Begin your journey as a campaign of exploration.  You are finding the best path to success, and you will do it by ensuring that every solution you deploy is instrumented with sensors that help you identify the efficacy of the solution in real time.  Real time data allows us to inductively identify patterns that form specific hypothesis.  We then deductively test these hypotheses through comparatively low-cost solutions, the results of which help inform further induction.  This circle of induction and deduction propels us through our journey to success.

Subscribe to the AKF Newsletter

Contact Us

If You Are Not Measuring, You Are Not Managing

July 10, 2019  |  Posted By: Bill Armelin

image of hands holding a caliper with the word goals in between

We are surprised at how often we go into a client and find that management does not have any metrics for their teams. The managers respond that they don’t want to negatively affect the team’s autonomy or that they trust the team to do the right things. While trusting your teams is a good thing, how do you know what they are doing is right for the company? How can you compare one team to another? How do you know where to focus on improvements?

Recently, we wrote an article about team autonomy, discussing how an empowered team is autonomous within a set of constraints. The article creates an analogy to driving a car, with the driver required to reach a specific destination, but empowered to determine WHAT path to take and WHY she takes it. She has gauges, such as a speedometer to give feedback on whether she is going too fast or too slow. Imagine driving a car without a speedometer. You will never know if you are sticking to the standard (the speed limit) or when you will get to where you need to go (velocity).

As a manager, it is your responsibility to set the appropriate metrics to help your teams navigate through the path to building your product. How can you hold your teams to certain goals or standards if you can’t tell them where they are in relation to the goal or standard today? How do you know if the actions you are taking are creating or improving shareholder value?

What metrics do you set for your teams? It is an important question. Years ago, while working at a Big 6 consulting firm, I had the pleasure of working with a very astute senior manager. We were redesigning manufacturing floors into what became Lean Manufacturing. He would walk into a client and ask them what the key metrics were. He would then proceed to tell them what their key issues were. He was always right. With metrics, you get what you measure. If you align the correct metrics with key company goals, then all is great. If you misalign them, you end up with poor performance and questionable behaviors.

So, what are the right metrics for a technology team? In 2017, we published an article on what we believe are the engineering metrics by which you should measure your teams. Some of the common metrics we focused on were velocity, efficiency, and cost. At initial glance, you might think that these seem “big brother-ish.” But, in reality, these metrics will provide your engineering teams with critical feedback to how they are doing. Velocity helps a team identify structural defects within the team (and should not be used to compare against other teams or push them to get more done). Efficiency helps the teams identify where they are losing precious development time to less valuable activities, such as meetings, interviews and HR training. It helps them and their managers quantify the impact of non-development and reduce such activities.

Cost helps the team identify how much they are spending on technology. We have seen this metric particularly used effectively in companies deploying to the cloud. Many companies allow cloud spending to significantly and uncontrollably increase as they grow. Looking at costs exposes things like the need for autoscaling to reduce the number of instances required during off peak times, or to purge unused instances that should be shut down.

The key to avoiding metrics from being perceived as overbearing is to keep them transparent. The teams must understand the purpose of the metric and how it is calculated. Don’t use them punitively. Use them to help the teams understand how they are doing in relation to the larger goals. How do you align the higher-level company goals to the work you teams are performing? We like to use Objectives and Key Results, or OKRs. This concept was created by Andy Grove at Intel and brought to Google by John Doerr. The framework aims to align higher level “objectives” to measurable “key results.” An objective at one level has several key results. These key results become the objectives for the next level down and defines another set of key results at that level. This continues all the way down to the lowest levels of the company resulting in alignment of key results and objectives across the entire company.

Choosing the Right Metric

Metrics-driven institutions demonstrably outperform those that rely on intuition or “gut feel.” This stated, poorly chosen metrics or simply too many metrics may hinder performance.

  1. A handful of carefully chosen metrics. Choose a small number of key metrics over a large volume. Ideally, each Agile team should be evaluated/tasked with improving 2-3 metrics (no more than 5). (Of note, in numerous psychological studies, the quality of decision-making has actual been shown to decrease when too much information is presented).
  2. Easy to collect and or calculate. A metric such as “Number of Customer Service Tickets per Week” although crude, is better than “Engineer Hours spent fixing service” as it requires costly time/effort to collect.
  3. Directly Controllable by the Team. Assigning a metric such as “Speed and Accuracy of Search” to a Search Service is preferred to “Overall Revenue” which is less directly controllable.
  4. Reflect the Quality of Service. The number of abandoned shopping carts reflects the quality of a Shopping Cart service, whereas number of shopping cart views is not necessarily reflective of service quality.
  5. Difficult to Game. The innate human tendency to game any system should be held in check by selecting the right metrics. Simple velocity measures are easily gamed while the number of Sev 1 incidents cannot be easily gamed.
  6. Near Real Time Feedback. Metrics that can be collected and presented over short-time intervals are most desirable. Information is most valuable when fresh — Availability week over week is better than a yearly availability measure.

Managers are responsible for the performance of their teams in relation to the company’s objectives and how they create shareholder value. Measuring how your teams are performing against or their contribution to those goals is only speculation if you don’t have the correct measurements and metrics in place. The bottom line is, “If you are not measuring, you are not managing.”

Are you having difficult defining the right metrics for your teams? Are you interested in defining OKRs but don’t know where or how to get started? AKF has helped many companies identify and implement key metrics, as well as implementing OKRs.  We have over 200 years of combined experience helping companies ensure their organizations, processes, and architecture are aligned to the outcomes they desire. Contact us, we can help.

Subscribe to the AKF Newsletter

Contact Us

Data Center Lifespan Risk

July 1, 2019  |  Posted By: Greg Fennewald

As technology professionals, managing risk is an important part of the value we provide to the business.  Risk can take many forms, including threats to availability, scalability, information security, and time to market.  Physical layer risks from the data center realm can severely impact availability, as the events of the February 2019 Wells Fargo outage demonstrate.

Transitioning Away from On-Prem Hosting

Over the last decade, knowledge of data center architecture, operating principles, capabilities, and associated risks has decreased in general due to the rise of managed hosting and especially cloud hosting.  This is particularly true for small and medium sized companies, which may have chosen cloud hosting early on and thus never have dealt with colocation or owned data centers.  This is not necessarily a bad trend – why devote resources to learn domains that are not core to your value proposition? 

While knowledge of data center geekdom may have decreased, the risks associated with data centers has not substantially changed.  Even the magic pixie dust of cloud hosting is a data center at its core, albeit with a degree of operational excellence exceeding the stereotypical company-owned data center + colo combination.

Given that technologists can mitigate data center risks by choosing cloud hosting with a major provider capable of mastering data center operations, why spend any time to learn about data center risks?

  • Cloud hosting sites do encounter failures.  The ability to ask informed questions during the vendor selection process can help optimize the availability for your business.
  • Business or regulatory changes may force a company to use colocation to meet data residency or other requirements.
  • A company may grow to the size where owning data centers makes business sense for a portion of their hosting need.
  • A hosting provider could exit the business or face bankruptcy, forcing tenants to take over or move on short notice.  Been there, done that, got the T shirt.

Data Center Lifespan Risk

For the purposes of this article, we will consider data center lifespan risk.  We define this risk as the probability of an infrastructure failure causing significant, and possibly complete, business disruption and the level of difficulty in restoring functionality.

A chart of data center lifespan risk resembles a bathtub – a high level of failures as the site is first built and undergoing 5 levels of commissioning towards the left side of the chart, followed by a long period of lower threat that can extend 15 years or more.  As time continues to march on, the risk rises again, creating the right-hand side of the bathtub curve.

data center lifespan risk

The risk of failure increases over time as the useful service life of infrastructure components approach their end.  The risk of failure approaches unity over a sufficiently long-time span.

Service Life Examples

Below are some service life estimates based on our experience for critical data center components that are properly maintained;

ComponentService LifeComment
UPS batteries4 years VRLA, 12+ wet cellBattery string monitoring strongly recommended
Diesel generator30+ years12,000+ hours before overhaul, run 100 or less annually
Main switchgear PLC15 + yearsPLC model EOL is the risk
CRAH/CRAC fan motors12+ yearsThe magic smoke wants to escape
Galvanized cooling tower wet surfaces10 yearsVaries with water chemistry, stainless steel worth the cost
Electrical distribution board25+ yearsEOL of breaker style and PLC is the risk
Chilled water piping30 yearsDesign for continuous duty, ~ 7 FPS flow velocity

All the above examples are measured in years.  If you are in the early years of a data center lifespan, there’s not a lot to worry about other than batteries.  Most growing companies are more concerned about adequate capacity, availability, and cost when they create their hosting strategy.  Not much thought is given to an exit strategy.  Such an effort is probably not worth it for a startup company, but established companies need to be thinking beyond next quarter and next year.

If your product or service can survive the loss of a single hosting site without impact (i.e. multi-active sites with validated traffic shifts), you could afford to run a bit deeper into the service life timeline.  If you can’t - or, like Wells Fargo thought you could but learned the hard way that was not the case - you need to plan ahead to mitigate these risks.

The Risks

As mentioned before, the risks we want to mitigate are an impactful failure and a complex restoration after a failure.  By complex, we mean trying to find parts and trained technicians for components that were EOL 5 years ago and end of OEM support 18 months ago.  Not a fun place to be.  Would you feel comfortable running your online business with switches and routers that are EOL and EOS?  Hopefully not.  Why would you do so for your hosting location?

Mitigating the Risks

The best way to mitigate the risk of an impactful infrastructure failure is to be able to survive the loss of a hosting site regardless of type with business disruption that is acceptable to the business and customers.  That could vary, your hosting solution should be tailored to the needs of the business.

Some thoughts on aging hosting sites;

  • All the characteristics that make cloud hosting taste great and be less filling (containerization, automation, infrastructure as code, orchestration, etc.) can also make the effort to stand up a new site and exit an old one much less onerous.
  • If you are committed to an owned data center or colo, moving to a newer site is the best choice.  Could you combine a move with a tech refresh cycle?  Could the aging data center fulfill a different purpose such as hosting development and QA environments?  Such environments should have less business impact from a failure, and you can squeeze out the last few years of life from that site.
  • You can purchase extra spare parts for components nearing EOL or EOS and send technicians to training courses.  This can mitigate risk but is really analogous to convincing yourself that you can scale your DB by tuning the SQL queries.  Viable only to add 6 or 12 months to a move/exit timeline.

Just about any of the components mentioned above in the useful life estimate can be replaced, especially if the data center can be shut down for weeks or months to make the replacement and test the systems.  Trying to replace components while still serving traffic is extremely risky.  Very few data centers have the redundancy to replace electrical components while still providing conditioned power and cooling to the server rooms.  Those sites that can usually cannot do so without reducing their availability.  We’ve had to take a dual UPS (2N) site to a single UPS source (N) for a week to correct a serious design flaw.  Single corded is not appropriate if your DR plan checks an audit box and not much else


The tremendous popularity of cloud hosting does not alleviate the need to understand physical layer risks, including data center lifespan risks.  Understanding them enables technology leaders to mitigate the risks.

Interested in learning more?  Need assistance with hosting strategy?  Considering a transition to SaaS? AKF Partners can help.


Subscribe to the AKF Newsletter

Contact Us

Transforming to SaaS is not just a technology change

June 19, 2019  |  Posted By: Larry Steinberg

Transforming a traditional on-premise product and company to a SaaS model is currently in vogue and has many broad-reaching benefits for both producers and consumers of services. These benefits span financial, supportability, and consumption simplification. 

In order to achieve the SaaS benefits, your company must address the broad changes necessitated by SaaS and not just the product delivery and technology changes. You must consider the impact on employees, delivery, operations/security, professional services, go to market, and financial impacts.

Employee Transformation

The employee base is one key element of change – moving from a traditional ‘boxed’ software vendor to an ‘as a Service’ company changes not only the skill set but also the dynamics of the engagement with your customers. Traditionally staff has been accustomed to having an arms-length approach for the majority of customer interactions. This traditional process has consisted of a factory that’s building the software, bundling, and a small group (if any) who ‘touch’ the customer. As you move ‘to as a service’ based product, the onus on ensuring the solution is available 24x7 is on everyone within your SaaS company. Not only do you require the skillsets to ensure infrastructure and reliability – but also the support model can and should change.

Now that you are building SaaS solutions and deploy them into environments you control, the operations are much ‘closer’ to the engineers who are deploying builds. Version upgrades happen faster, defects will surface more rapidly, and engineers can build in monitoring to detect issues before or as your customers encounter them. Fixes can be provided much faster than in the past and strict separation of support and engineering organizations are no longer warranted. Escalations across organizations and separate repro steps can be collapsed. There is a significant cultural shift for your staff that has to be managed properly. It will not be natural for legacy staff to adopt a 24x7 mindset and newly minted SaaS engineers likely don’t have the industry or technology experience needed. Finding a balance of shifting culture, training, and new ‘blood’ into the organization is the best course of action.

Passion For Service Delivery

Having Services available 24x7 requires a real passion for service delivery as teams no longer have to wait for escalations from customers, now engineers control the product and operating environment. This means they can see availability and performance in real time. Staff should be proactive about identifying issues or abnormalities in the service. In order to do this the health and performance of what they built needs to be at the forefront of their everyday activities. This mindset is very different than the traditional onpremise software delivery model.

Security Implications

Shifting the operations from your customer environments to your own environments also has a security aspect. Operating the SaaS environment for customers shifts the liability from them to you. The security responsibilities expand to include protecting your customer data, hardening the environments, having a practiced plan of incident remediation, and rapid response to identified vulnerabilities in the environment or application.

Finance & Accounting

Finance and accounting are also impacted by this shift to SaaS - both on the spend/capitalization strategy as well as cost recovery models. The operational and security components are a large cost consideration which has been shifted from your customers to your organization and needs to be modeling into SaaS financials. Pricing and licensing shifts are also very common. Moving to a utility or consumption model is fairly mainstream but is generally new for the traditional product company and its customers. Traditional billing models with annual invoices might not fit with the new approach and systems + processes will need to be enhanced to handle. If you move to a utility-based model both the product and accounting teams need to partner on a solution to ensure you get paid appropriately by your customers.

Customer Service

Think through the impacts on your customer support team. Given the speed at which new releases and fixes become available the support team will need a new model to ensure they remain up to date as these delivery timeframes will be much more rapid than in the past and they must stay ahead of your customer base.

Your go to market strategies will most likely also need to be altered depending on the market and industry. To your benefit, as a SaaS company, you now have access to customer behavior and can utilize this data in order to approach opportunities within the customer base. Regarding migration, you’ll need a plan which ensures you are the best option amongst your competitors.

Most times at AKF we see companies who have only focused on product and technology changes when moving to SaaS but if the whole company doesn’t move in lockstep then progress will be limited.  You are only as strong as your weakest link.

We’ve helped companies of all sizes transition their technology – AND organization – from on-premises to the cloud through SaaS conversion. Give us a call – we can help!


Subscribe to the AKF Newsletter

Contact Us

 1 2 3 >  Last ›


Most Popular: