Results and Outcomes – Why Companies Fail
April 21, 2019 | Posted By: Pete Ferguson
Results = Results
Apple, Google, and Amazon don’t exist based on a Utopian promise of what is to come – though certainly those promises keep their customers engaged and hopeful for the future. These companies exist because of the value they have delivered to date and created expectations for us as consumers for a consistent result.
I’m amazed at how simple of a concept Results = Results is – yet constantly we see companies struggle with the concept and we see it as a recurring theme in our 2-3 day workshops with our clients and something we look for in our technical due diligence reviews.
As a corporate survivor of 18 years, looking back I can see where I was distracted by day-today meetings, firefighting, and getting hijacked by initiatives that seemed urgent to some senior leader somewhere – but were not really all that important.
Suddenly the quarter or half was over and it was time to do a self-evaluation and realize all the effort, all the stress, all the work, wasn’t getting the desired results I’d committed to earlier in the year and I’d have to quickly shuffle and focus on getting stuff done.
While keeping the lights on is important, it diminishes in importance when to do so is at the expense of innovating and adding value to our customers – not just struggling to maintain the status quo.
Outcomes and Key Results (OKRs)
Adapted from John Doerr’s “Objectives” and key results – at AKF we find it more to the point to focus on “outcomes.” Objectives (definition: a thing aimed at or sought) are a path where as “outcomes” are a destination that is clearly defined to know you have arrived.
Outcomes are the only things that matter to our customers. Hearing about a desired Utopian state is great and may excite customers to stick around for awhile and put up with current limitations or lack of functionality – but being able to clearly define that you have delivered an outcome and the value to your customers is money in the bank and puts us ahead of our competition.
Yet the majority of our clients have teams who are so focused on cost-cutting for many years that they leave a wide open berth for young startups and their competition to move in and start delivering better outcomes for the customer.
How to Focus on Results and Outcomes
It is easy to become distracted in the day-to-day meetings, incident escalations, post mortems, ect. As an outside third party, however, it is blatantly obvious to us usually within the first hour of meeting with a new team whether or not they are properly focused.
Here are some of the common themes and questions to ask:
- Is there effective monitoring to discover issues before our customers do?
- Do we monitor business metrics and weigh the success (and failure) of initiatives based not on pushing out a new platform or product but whether or not there was significant ROI?
- How much time is spent limping along to keep a legacy application up and running vs. innovating?
- Do we continually push off hardware/software upgrades until we are held hostage by compliance and/or end-of-life serviceability by the vendor?
Hopefully the common theme here is obvious – what is the customer experience and how focused are we on them vs internal castle building or day-to-day distractions?
Recently in a team interview the IT “keep the lights on” team told us they were working to be strategic and innovative by hiring new interns. While the younger generations are definitely less prone to accepting the status quo, the older generation are conceding that they don’t want to be part of the future. And unfortunately they may not be sooner than planned if they don’t grasp their role in driving innovation and the importance of applying their institutional knowledge.
Not focusing on customer/shareholder related outcomes means that shareholders and customers are negatively impacted. Here are a few problems with the associated outcomes I’ve seen in my short tenure with AKF and previously as a corporate crusader:
Monolithic applications to save costs: Why organizations do it? Short term cost savings focus development on one application. Allows teams to only focus on development of their one area.
- One failure means everyone fails.
- Organizations are unable to scale vis-a-vis Conway’s Law (organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations).
- Often the teams who develop the monolith don’t have to support it, so they don’t understand why it is a problem.
- Teams become very focused on solving the problems caused by the monolith just long enough to get it back up and running but fail to see the long-term recurrent loss to the business and wasted hours that could have been spent on innovating new products and services.
- Catastrophic failure - Intuit pre SaaS, early renditions of iTunes and annual outages when everyone tried to redeem gift cards Christmas morning, early days of eBay, stay tuned, many more yet to come.
Ongoing cost cutting to “make the quarter.”
- MIssed tech refresh results in machines and operating systems no longer supported and vulnerable to external attacks.
- Teams become hyper focused on shutting down additional spending, but never take the time to calculate how much wasted effort is spent on keeping the lights on for aging systems with a declining market share or slowed new customer adoption rate.
- Start saying no to the customer based on cost opening the door for new upstarts and the competition to take away market share.
Focusing efforts on Sales Department’s latest contract.
- Too much investment in legacy applications instead of innovating new products.
- “A-team” developers become firefighters to keep customers happy.
- Sales team creates moral hazards for development teams (i.e. “I smoke, but you get lung cancer” - teams create problems for other teams to fix instead of owning the end-to-end lifecycle of a product)
Focus is on mergers and acquisitions instead of core strengths and products.
- Distracted organizations give way for upstarts and competition.
- Become okay or maybe even good at a lot of things but not great at one or two things.
- Company culture becomes very fragmented and silos create red tape that slows or stifles innovation.
Results = Results. And nothing else equals results.
If OKRs are not measuring the results needed to compete and win, then teams are wasting a lot of effort, time, and money and the competition is getting a free pass to innovate and outperform your ability to delight and please your customers.
Need an outside view of your organization to help drive better results and outcomes? Contact us!
Photo by rawpixel.com from Pexels
Subscribe to the AKF Newsletter
The AKF Difference
December 4, 2018 | Posted By: Marty Abbott
During the last 12 years, many prospective clients have asked us some variation of the following questions: “What makes you different?”, “Why should we consider hiring you?”, or “How are you differentiated as a firm?”.
The answer has many components. Sometimes our answers are clear indications that we are NOT the right firm for you. Here are the reasons you should, or should not, hire AKF Partners:
Operators and Executives – Not Consultants
Most technology consulting firms are largely comprised of employees who have only been consultants or have only run consulting companies. We’ve been in your shoes as engineers, managers and executives. We make decisions and provide advice based on practical experience with living with the decisions we’ve made in the past.
Engineers – Not Technicians
Educational institutions haven’t graduated enough engineers to keep up with demand within the United States for at least forty years. To make up for the delta between supply and demand, technical training services have sprung up throughout the US to teach people technical skills in a handful of weeks or months. These technicians understand how to put building blocks together, but they are not especially skilled in how to architect highly available, low latency, low cost to develop and operate solutions.
The largest technology consulting companies are built around programs that hire employees with non-technical college degrees. These companies then teach these employees internally using “boot camps” – creating their own technicians.
Our company is comprised almost entirely of “engineers”; employees with highly technical backgrounds who understand both how and why the “building blocks” work as well as how to put those blocks together.
Product – Not “IT”
Most technology consulting firms are comprised of consultants who have a deep understanding of employee-facing “Information Technology” solutions. These companies are great at helping you implement packaged software solutions or SaaS solutions such as Enterprise Resource Management systems, Customer Relationship Management Systems and the like. Put bluntly, these companies help you with solutions that you see as a cost center in your business. While we’ve helped some partners who refuse to use anyone else with these systems, it’s not our focus and not where we consider ourselves to be differentiated.
Very few firms have experience building complex product (revenue generating) services and platforms online. Products (not IT) represent nearly all of AKF’s work and most of AKF’s collective experience as engineers, managers and executives within companies. If you want back-office IT consulting help focused on employee productivity there are likely better firms with which you can work. If you are building a product, you do not want to hire the firms that specialize in back office IT work.
Business First – Not Technology First
Products only exist to further the needs of customers and through that relationship, further the needs of the business. We take a business-first approach in all our engagements, seeking to answer the questions of: Can we help a way to build it faster, better, or cheaper? Can we find ways to make it respond to customers faster, be more highly available or be more scalable? We are technology agnostic and believe that of the several “right” solutions for a company, a small handful will emerge displaying comparatively low cost, fast time to market, appropriate availability, scalability, appropriate quality, and low cost of operations.
Cure the Disease – Don’t Just Treat the Symptoms
Most consulting firms will gladly help you with your technology needs but stop short of solving the underlying causes creating your needs: the skill, focus, processes, or organizational construction of your product team. The reason for this is obvious, most consulting companies are betting that if the causes aren’t fixed, you will need them back again in the future.
At AKF Partners, we approach things differently. We believe that we have failed if we haven’t helped you solve the reasons why you called us in the first place. To that end, we try to find the source of any problem you may have. Whether that be missing skillsets, the need for additional leadership, organization related work impediments, or processes that stand in the way of your success – we will bring these causes to your attention in a clear and concise manner. Moreover, we will help you understand how to fix them. If necessary, we will stay until they are fixed.
We recognize that in taking the above approach, you may not need us back. Our hope is that you will instead refer us to other clients in the future.
Are We “Right” for You?
That’s a question for you, not for us, to answer. We don’t employ sales people who help “close deals” or “shape demand”. We won’t pressure you into making a decision or hound you with multiple calls. We want to work with clients who “want” us to partner with them – partners with whom we can join forces to create an even better product solution.
Subscribe to the AKF Newsletter
The Importance of QA
November 20, 2018 | Posted By: Robin McGlothin
“Quality in a service or product is not what you put into it. It’s what the customer gets out of it.” Peter Drucker
The Importance of QA
High levels of quality are essential to achieving company business objectives. Quality can be a competitive advantage and in many cases will be table stakes for success. High quality is not just an added value, it is an essential basic requirement. With high market competition, quality has become the market differentiator for almost all products and services.
There are many methods followed by organizations to achieve and maintain the required level of quality. So, let’s review how world-class product organizations make the most out of their QA roles. But first, let’s define QA.
According to Wikipedia, quality assurance is “a way of preventing mistakes or defects in products and avoiding problems when delivering solutions or services to customers. But there’s much more to quality assurance.”
There are numerous benefits of having a QA team in place:
- Helps increase productivity while decreasing costs (QA HC typically costs less)
- Effective for saving costs by detecting and fixing issues and flaws before they reach the client
- Shifts focus from detecting issues to issue prevention
Teams and organizations looking to get serious about (or to further improve) their software testing efforts can learn something from looking at how the industry leaders organize their testing and quality assurance activities. It stands to reason that companies such as Google, Microsoft, and Amazon would not be as successful as they are without paying proper attention to the quality of the products they’re releasing into the world. Taking a look at these software giants reveals that there is no one single recipe for success. Here is how five of the world’s best-known product companies organize their QA and what we can learn from them.
Google: Searching for best practices
How does the company responsible for the world’s most widely used search engine organize its testing efforts? It depends on the product. The team responsible for the Google search engine, for example, maintains a large and rigorous testing framework. Since search is Google’s core business, the team wants to make sure that it keeps delivering the highest possible quality, and that it doesn’t screw it up.
To that end, Google employs a four-stage testing process for changes to the search engine, consisting of:
- Testing by dedicated, internal testers (Google employees)
- Further testing on a crowdtesting platform
- “Dogfooding,” which involves having Google employees use the product in their daily work
- Beta testing, which involves releasing the product to a small group of Google product end users
Even though this seems like a solid testing process, there is room for improvement, if only because communication between the different stages and the people responsible for them is suboptimal (leading to things being tested either twice over or not at all).
But the teams responsible for Google products that are further away from the company’s core business employ a much less strict QA process. In some cases, the only testing done by the developer responsible for a specific product, with no dedicated testers providing a safety net.
In any case, Google takes testing very seriously. In fact, testers’ and developers’ salaries are equal, something you don’t see very often in the industry.
Facebook: Developer-driven testing
Like Google, Facebook uses dogfooding to make sure its software is usable. Furthermore, it is somewhat notorious for shaming developers who mess things up (breaking a build or causing the site to go down by accident, for example) by posting a picture of the culprit wearing a clown nose on an internal Facebook group. No one wants to be seen on the wall-of-shame!
Facebook recognizes that there are significant flaws in its testing process, but rather than going to great lengths to improve, it simply accepts the flaws, since, as they say, “social media is nonessential.” Also, focusing less on testing means that more resources are available to focus on other, more valuable things.
Rather than testing its software through and through, Facebook tends to use “canary” releases and an incremental rollout strategy to test fixes, updates, and new features in production. For example, a new feature might first be made available only to a small percentage of the total number of users.
Canary Incremental Rollout
By tracking the usage of the feature and the feedback received, the company decides either to increase the rollout or to disable the feature, either improving it or discarding it altogether.
Amazon: Deployment comes first
Like Facebook, Amazon does not have a large QA infrastructure in place. It has even been suggested (at least in the past) that Amazon does not value the QA profession. Its ratio of about one test engineer to every seven developers also suggests that testing is not considered an essential activity at Amazon.
The company itself, though, takes a different view of this. To Amazon, the ratio of testers to developers is an output variable, not an input variable. In other words, as soon as it notices that revenue is decreasing or customers are moving away due to anomalies on the website, Amazon increases its testing efforts.
The feeling at Amazon is that its development and deployment processes are so mature (the company famously deploys software every 11.6 seconds!) that there is no need for elaborate and extensive testing efforts. It is all about making software easy to deploy, and, equally if not more important, easy to roll back in case of a failure.
Spotify: Squads, tribes and chapters
Spotify does employ dedicated testers. They are part of cross-functional teams, each with a specific mission. At Spotify, employees are organized according to what’s become known as the Spotify model, constructed of:
- Squads. A squad is basically the Spotify take on a Scrum team, with less focus on practices and more on principles. A Spotify dictum says, “Rules are a good start, but break them when needed.” Some squads might have one or more testers, and others might have no testers at all, depending on the mission.
- Tribes are groups of squads that belong together based on their business domain. Any tester that’s part of a squad automatically belongs to the overarching tribe of that squad.
- Chapters. Across different squads and tribes, Spotify also uses chapters to group people that have the same skillset, in order to promote learning and sharing experiences. For example, all testers from different squads are grouped together in a testing chapter.
- Guilds. Finally, there is the concept of a guild. A guild is a community of members with shared interests. These are a group of people across the organization who want to share knowledge, tools, code and practices.
Spotify Team Structure
Testing at Spotify is taken very seriously. Just like programming, testing is considered a creative process, and something that cannot be (fully) automated. Contrary to most other companies mentioned, Spotify heavily relies on dedicated testers that explore and evaluate the product, instead of trying to automate as much as possible. One final fact: In order to minimize the efforts and costs associated with spinning up and maintaining test environments, Spotify does a lot of testing in its production environment.
Microsoft: Engineers and testers are one
Microsoft’s ratio of testers to developers is currently around 2:3, and like Google, Microsoft pays testers and developers equally—except they aren’t called testers; they’re software development engineers in test (or SDETs).
The high ratio of testers to developers at Microsoft is explained by the fact that a very large chunk of the company’s revenue comes from shippable products that are installed on client computers & desktops, rather than websites and online services. Since it’s much harder (or at least much more annoying) to update these products in case of bugs or new features, Microsoft invests a lot of time, effort, and money in making sure that the quality of its products is of a high standard before shipping.
What you can learn from world-class product organizations? If the culture, views, and processes around testing and QA can vary so greatly at five of the biggest tech companies, then it may be true that there is no one right way of organizing testing efforts. All five have crafted their testing processes, choosing what fits best for them, and all five are highly successful. They must be doing something right, right?
Still, there are a few takeaways that can be derived from the stories above to apply to your testing strategy:
- There’s a “testing responsibility spectrum,” ranging from “We have dedicated testers that are primarily responsible for executing tests” to “Everybody is responsible for performing testing activities.” You should choose the one that best fits the skillset of your team.
- There is also a “testing importance spectrum,” ranging from “Nothing goes to production untested” to “We put everything in production, and then we test there, if at all.” Where your product and organization belong on this spectrum depends on the risks that will come with failure and how easy it is for you to roll back and fix problems when they emerge.
- Test automation has a significant presence in all five companies. The extent to which it is implemented differs, but all five employ tools to optimize their testing efforts. You probably should too.
Bottom line, QA is relevant and critical to the success of your product strategy. If you’d tried to implement a new QA process but failed, we can help.
Subscribe to the AKF Newsletter
Diagnosing and Fixing Software Development Performance
November 20, 2018 | Posted By: Roger Andelin
Diagnosing the cause of poor performance from your engineering team is difficult and can be costly for the organization if done incorrectly. Most everyone will agree that a high performing team is more desirable than a low performing team. However, there is rarely agreement as to why teams are not performing well and how to help them improve performance. For example, your CFO may believe the team does not have good project management and that more project management will improve the team’s performance. Alternatively, the CEO may believe engineers are not working hard enough because they arrive to the office late. The CMO may believe the team is simply bad and everyone needs to be replaced.
Often times, your CTO may not even know the root causes of poor performance or even recognize there is a performance problem until peers begin to complain. However, there are steps an organization can take to uncover the root cause of poor performance quickly, present those findings to stakeholders for greater understanding, and take steps that will properly remove the impediments to higher performance. Those steps may include some of the solutions suggested by others, but without a complete understanding of the problem, performance will not improve and incorrect remedies will often make the situation worse. In other words, adding more project management does not always solve a problem with on time delivery, but it will add more cost and overhead. Requiring engineers to start each day at 8 AM sharp may give the appearance that work is getting done, but it may not directly improve velocity. Firing good engineers who face legitimate challenges to their performance may do irreversible harm to the organization. For instance, it may appear arbitrary to others and create more fear in the department resulting in unwanted attrition. Taking improper action will make things worse rather than improve the situation.
How can you know what action to take to fix an engineering performance problem? The first step in that process is to correctly define and agree upon what good performance looks like. Good performance is comprised of two factors: velocity and value.
Velocity is defined as the speed at which the team works and value is defined as achievement of business goals. Velocity is measured in story points which represent the amount of work completed. Value is measured in business terms such as revenue, customer satisfaction or conversion. High performing engineering teams work quickly and their work has a measurable impact on business goals. High performing teams put as much focus on delivering a timely release as they do on delivering the right release to achieve a business goal.
Once you have agreement on the definition of good engineering performance, rate each of your engineering teams against the two criteria: velocity and value. You may use a chart like the one below:
Once each team has been rated, write down a narrative that justifies the rating. Here are a few examples:
Bottom Left: Velocity and Value are Low
“My requests always seem to take a long time. Even the most simple of requests takes forever. And, when the team finally gets around to completing the request, often times there are problems in production once the release is completed. These problems have negatively impacted customers’ confidence in us so not only are engineers not delivering value – they are eroding it!”
Upper/Middle Left: Velocity is Good and Value is Low
“The team does get stuff done. Of course I’d like them to go faster, but generally speaking they are able to get things done in a reasonable amount of time. However, I can’t say if they are delivering value – when we release something we are not tracking any business metrics so I have no way of knowing!”
Upper Right: Velocity is High and Value is High
“The team is really good. They are tracking their velocity in story points and have goals to improve velocity. They are already up 10% over last year. Also, they instrument all their releases to measure business value. They are actively working with product management to understand what value needs to be delivered and hypothesize with the stakeholders as to what features will be best to deliver the intended business goal. This team is a pleasure to work with.”
Unknown Velocity and Unknown Value
“I don’t know how to rate this team. I don’t know their velocity; its always changing and seems meaningless. I think the team does deliver business value, but they are not measuring it so I cannot say if it is low or high.”
With narratives in hand it’s time to begin digging for more data to support or invalidate the ratings.
Diagnosing Velocity Problems
Engineering velocity is a function of time spent developing. Therefore, the first question to answer is “what is the maximum amount of time my team is able to spend on engineering work under ideal conditions?”
This is a calculated value. For example, start with a 40 hour work week. Next, assuming your teams are following an Agile software development process, for each engineering role subtract out the time needed each week for meetings and other non-development work. For individual contributors working in an Agile process that number is about 5 hours per week (for stand up, review, planning and retro). For managers the number may be larger. For each role on the team sum up the hours. This is your ideal maximum.
Next, with the ideal maximum in hand, compare that to the actual achievement. If your teams are not logging hours against their engineering tasks, they will need to do this in order to complete this exercise. Evaluate the gap between the ideal maximum and the actual. For example, if the ideal number is 280 hours and the team is logging 200 hours, then the gap is 80 hours. You need to determine where that 80 hours is going and why. Here are some potential problems to consider:
- Teams are spending extra time in planning meetings to refine requirements and evaluating effort.
- Team members are being interrupted by customer incidents which they are required to support.
- The team must support the weekly release process in addition to their other engineering tasks.
- Miscellaneous meeting are being called by stakeholders including project status meetings and updates.
As you dig into this gap it will become clear what needs to be fixed. The results will probably surprise you. For example, one client was faced with a software quality problem. Determined to improve their software quality, the client added more quality engineers, built more unit tests, and built more automated system tests. While there is nothing inherently wrong with this, it did not address the root cause of their poor quality: Rushing. Engineers were spending about 3-4 hours per day on their engineering tasks. Context switching, interruptions and unnecessary meetings eroded quality engineering time each day. As a result, engineers rushing to complete their work tasks made novice mistakes. Improving engineering performance required a plan for reducing engineering interruptions, unnecessary meetings, and enabling engineers to spend more uninterrupted time on their development tasks.
At another client, the frequency of production support incidents were impacting team velocity. Engineers were being pulled away from their daily engineering tasks to work on problems in production. This had gone on so long that while nobody liked it, they accepted it as normal. It’s not normal! Digging into the issue, the root cause was uncovered: The process for managing production incidents was ineffective. Every incident was urgent and nearly every incident disrupted the engineering team. To improve this, a triage process was introduced whereby each incident was classified and either assigned an urgent status (which would create an interruption for the team) or something lower which was then placed on the product backlog (no interruption for the team). We also learned the old process (every incident was urgent) was in part a response to another velocity problem; stakeholders believed that unless something was considered urgent it would never get fixed by the engineering team. By having an incident triage process, a procedure for when something would get fixed based on its urgency, the engineering team and the stakeholders solved this problem.
At AKF, we are experts at helping engineering teams improve efficiency, performance, fixing velocity problems, and improving value. In many cases, the prescription for the team is not obvious. Our consultants help company leaders uncover the root causes of their performance problems, establish vision and execute prescriptions that result in meaningful change. Let us help you with your performance problems so your teams can perform at their best!
Subscribe to the AKF Newsletter
A Financial Analogy for Technical Debt
October 12, 2018 | Posted By: Bill Armelin
Understanding Technical Debt
During the course of our client engagements, there are a few common topics or themes that are always discussed, and the clients themselves usually introduce them. One such area is technical debt. Every team has it, every team believes they have too much of it, and every team struggles to explain why it’s important to address it.
Let’s start by defining what technical debt means. It is the difference between doing something the “desired” or “best” way and doing something quickly (i.e. reduce time to market). The difference results in the company taking on “debt” within the solution. Technical debt requires acting with forethought. In other words, you only assume technical debt knowingly and with commission. Acts of omission (forgetting to plan or do something) do not count as debt. Our partners in business may think we are hiding the truth if we do not clearly delineate the difference between debt (known assumptions) and mistakes, failures or other issues related to maintenance.
The following list provides examples of things that are not tech debt:
- Software defects (unless we decide to NOT fix them for an extended period of time – but defects are still human failures – not debt.)
- Failures in design that are not previously tagged as debt.
- Failures to identify scalability bottle necks.
- Poor choices in technology components that fail to scale.
- Failure to properly identify infrastructure failures, or high failure rates of vendors in infrastructure.
A Financial Analogy for Tech Debt
When you hear the words “technical debt”, it invokes a negative connotation. However, the judicious use of tech debt is a valuable addition to your product development process. Tech debt is analogous to financial debt. Companies can raise capital to grow their business by either issuing equity or issuing debt. Issuing equity means giving up a percentage of ownership in the company and dilutes current shareholder value. Issuing debt requires the payment of interest but does not give up ownership or dilute shareholder value. Issuing debt is good, until you can’t service it. Once you have too much debt and cannot pay the interest, you are in trouble.
Tech debt operates in the same manner. Companies use tech debt to defer performing work on a product. As we develop our minimum viable product, we build a prototype, gather feedback from the market, and iterate. The parts of the product that didn’t meet the definition of minimum or the decisions/shortcuts made during development represent the tech debt that was taken on to get to the MVP. This is the debt that we must service in later iterations. In fact, our definition of done must include the servicing of the resulting tech debt. Taking on tech debt early can pay big dividends by getting your product to market faster. However, like financial debt, you must service the interest. If you don’t, you will begin to see scalability and availability issues. At that point, refactoring the debt becomes more difficult and time critical. It begins to affect your customers’ experience.
Many development teams have a hard time convincing leadership that technical debt is a worthy use of their time. Why spend time refactoring something that already “works” when you could use that time to build new features customers and markets are demanding now? The danger with this philosophy is that by the time technical debt manifests itself into a noticeable customer problem, it’s often too late to address it without a major undertaking. It’s akin to not having a disaster recovery plan when a major availability outage strikes. To get the business on-board, you must make the case using language business leaders understand – again this is often financial in nature. Be clear about the cost of such efforts and quantify the business value they will bring by calculating their ROI. Demonstrate the cost avoidance that is achieved by addressing critical debt sooner rather than later - calculate how much cost would be in the future if the debt is not addressed now. The best practice is to get leadership to agree and commit to a certain percentage of development time that can be allocated to addressing technical debt on an on-going basis. If they do, it’s important not to abuse this responsibility. Do not let engineers alone determine what technical debt should be paid down and at what rate – it must have true business value that is greater than or equal to spending that time on other activities.
Just as with debt that a company assumes, in and of itself, technical debt is not bad. It can be looked at as a leveraging tool to optimize the technology resources in the short term - delaying a hardware tech refresh or the release date for HTML 5. Delaying attention to address technical issues allows greater resources to be focused on higher priority endeavors. The absence of technical debt probably means missed business opportunities– use technical debt as a tool to best meet the needs of the business. However, excessive technical debt will cause availability and scalability issues, and can choke business innovation (too much engineering time dealing with debt rather than focusing on the product).
Develop a technology balance sheet and profit and loss (income) statement to discuss tech debt with the business in a manner they understand – finance. Let’s first look at the balance sheet, where Assets = Liabilities + Equity. Our assets are the engineering time spent creating the product. Liabilities are the principle of the tech debt (i.e. the difference between “desired” and “actual.” Equity is the remainder, or the engineering resources spent creating the product while not contributing to tech debt.
Here is an example of a technology balance sheet:
To further the financial analogy, we need to have a technology P&L statement. Here, the interest on tech debt is the difficulty or increased level of effort in modifying something in subsequent releases. This manifests as a reduction in developer productivity per value created. The more debt you take on or less principle you pay down, the higher your interest payment becomes, and the cost to the organization.
Dedicating resources on an ongoing basis to service technical debt can be a challenging discussion with the business. Resources are always limited and employing them in the manner which best benefits the business is a critical business priority decision. Similar to the notion of debt within business, you should never take on technical debt without a plan to pay the interest (increased future cost of development) and principal (fixing the difference between appropriate and as-is). Relating technical debt to financial debt can help those outside of your technology organization grasp the concept and understand the need to keep technical debt under control.
One way to make the concept of debt real is to estimate, for any debt item, the amount of “interest” one will need to pay in the future to modify the solution in question.
- For the benefit of time to market, you decide to “hard code” a number of “display strings” that you’d rather set aside in a resource file to modify and translate later.
- You save 2 weeks of development time, creating a 2-week liability on your balance sheet. You have a 2-week principal to fix.
- You estimate that for all future string modifications (or translations) it will take an additional day of development. Your interest is 1 day, payable for each modification.
Just as retiring all financial liabilities at once does not make good business sense, trying to wipe out technical debt in one fell swoop is a bad idea. Continuous service to the technical debt is required to prevent technical liabilities from wiping out technical equity. An informed decision to increase debt service to reduce the principal will result in more productive product development time (smaller debt requires less on-going service). A short-term decision to reduce tech debt service in favor of a critical product launch may be viable if not used often. Keep track of both your principal (balance sheet) and your interest payments (income statement). Use these to help your business partners with debt related decisions.
Do NOT mix the cost of defects, or other infrastructure and software mistakes with tech debt. Doing so creates two very big problems:
- It becomes harder for the technology team to learn from past mistakes. Mistakes are mistakes and we should use them as learning opportunities. Debt is taken thoughtfully. Track them separately and treat them differently.
- Using the debt term for non-debt related items, will lower the level of trust between you and the business. Businesses don’t for instance “mistakenly” take on debt. Mixing these terms can cause relationship problems.
Additionally, be clear about how you define technical debt, so time spent paying it down is not commingled with other activities. Bugs in your code are not technical debt. Refactoring your code base to make it more scalable, however, would be. A good test is to ask if the path you chose was a conscious or unconscious decision. Meaning, if you decided to go in one direction knowing that you would later need to refactor. You are making a specific decision to do or not to do something knowing that you will need to address it later. Bugs are found in sloppy code, and that is not tech debt, it is just bad code.
Prioritizing Tech Debt
So how do you decide what tech debt should be addressed and how do you prioritize? If you have been tracking work with Agile storyboards and product backlogs, you should have an idea where to begin. Also, if you track your problems and incidents like we recommend, then this will show elements of tech debt that have begun to manifest themselves as scalability and availability concerns. Set a budget and begin paying down the debt. If you are working on less 12%, you are not spending enough effort. If you are spending over 25%, you are probably fixing issues that have already manifested themselves, and you are trying to catch up. Setting an appropriate budget and maintaining it over the course of your development efforts will pay down the interest and help prevent issues from arising.
Taking on technical debt to fund your product development efforts is an effective method to get your product to market quicker. But, just like financial debt, you need to take on an appropriate amount of tech debt that you can service by making the necessary interest and principle payments to reduce the outstanding balance. Failing to set an appropriate budget will result in a technical “bankruptcy” that will be much harder to dig yourself out of later.
Tech Debt Takeaways
Here is a list of our tech debt takeaways:
Want help reducing your tech debt? We can help.
Subscribe to the AKF Newsletter
The Domino or Multiplicative Effect of Failure
September 18, 2018 | Posted By: Pete Ferguson
As part of our Technical Due Diligence and Architectural reviews, we always want to see a company’s system architecture, understand their process, and review their org chart. Without ever stepping foot at a client we can begin to see the forensic evidence of potential problems.
Like that ugly couch you bought when you were in college and still have in your front room, often inefficiencies in architecture, process, and organization are nostalgic memories that have long since outlived their purpose – and while you have become used to the ugly couch, outsiders look in and recognize it as the eyesore it is immediately and often customers feel the inefficiencies through slow page loads and shopping cart issues. “That’s how it has always been” is never a good motto when designing systems, processes, and organizations for flexibility, availability, and scalability.
It is always interesting to hear companies talk with the pride of a parent about their unruly kid when they use words like “our architecture/organization is very complex” or “our systems/organization has a lot of interdependent components” – as if either of these things are something special or desirable! Great architectures are sketched out on a napkin in seconds, not hours.
Great architectures are sketched out on a napkin in seconds, not hours.
All systems fail. Complex systems fail miserably, and – like Dominos – take down neighboring systems as well resulting in latency, down time, and/or flat out failure.
ARCHITECTURE & SOFTWARE
Some common observations in hardware/software we repeatedly see:
Problem: Overloaded F5 or other similar firewalls are trying to encrypt all data because Personal Identifiable Information (PII) is stored in plain text, usually left over from a business decision made long ago that no one can quite recall and an auditor once said “encrypt everything” to protect it. Because no one person is responsible for a 30,000 foot view of the architecture, each team happily works in their silo and the decision to encrypt is held up like a trophy without seeing that the F5 is often running hot, causing latency, and is now a bottleneck (resulting in costly requests for more F5s) doing something it has no business doing in the first place.
Solution: Segregate all PII, tokenize it and only encrypt the data that needs to be encrypted, speeding up throughput and better isolating and protecting PII.
Integration (or Rather Lack Thereof) Of Mergers & Acquisitions
Problem: A recent (and often not so recent) flurry of acquisitions is resulting in cross data center calls in and out of firewalls. Purchased companies are still in their own data center or public cloud and the entire workflow of a customer request is crisscrossing the country multiple times not only causing latency, but if one thing goes wrong (remember, everything fails …) timeouts result in customer frustration and lost transactions.
Solution: Integrate services within one isolated stack or swim lane – either hosted or public cloud – to avoid cross data center calls. Replicate services so that each datacenter or cloud instance has everything it needs.
Problem: As the company grew and gained more market share, the search for bigger and better has resulted in a monolithic database that is slow, requires specialized hardware, specialized support, ongoing expensive software licenses, and maintenance fees. As a result, during peak times the database slows everyone and everything down. The temptation is to buy bigger and better hardware and pay higher monthly fees for more bandwidth.
Solution: Break down databases by customer, region, or other Z-Axis splits on the AKF Scale Cube. This has multiple wins – you can use commodity servers instead of large complex file storage, failure for one database will not affect the others, you can place customer data closest to the customer by region, and adding additional servers does not required a long lead time or request for substantial capital expenditure.
PROCESSES & ORGANIZATION
What sets AKF apart is that we don’t just look at systems, we always want to understand the people and organization supporting the system architecture as well and here there are additional multiplicative effects of failure. We have considerable expertise working for and with Fortune 100 companies, startups, and agencies in many different competencies. The common mistakes we see on the organization side of the equation:
Lack of Cross Functional Teams
Problem: Agile Scrum teams do not have all the resources needed within the team to be self sufficient and autonomous. As a result, teams are waiting on other internal resources for approvals or answers to questions in order to complete a Sprint - or keep these items on the backlog because effort estimation is too high. This results in decreased time to market, losing what could have been a competitive advantage, and lower revenue.
Solution: Create cross-functional teams so that each Sprint can be completed with necessary access to security, architecture, QA, and other resources. This doesn’t mean each team needs a dedicated resource from each discipline – one resource can support multiple teams. The information needed can be greatly augmented by creating guildes where the subject matter expert (SME) can “deputize” multiple people on what is required to meet policy. Guilds utilize published standards and provide a dedicated channel of communication to the SME greatly simplifying and speeding up the approval process.
Lack of Automation
Problem: It isn’t done enough! As a result, people are waiting on other people for needed approvals. Often the excuse is that there isn’t enough time or resources. In most cases when we do the math, the cost of not automating far outweighs the short-term investment with a continuous long-term payout that automation would bring. We often see that the individual with the deployment knowledge is insecure and doesn’t want automation as they feel their job is threatened. This is a very short-sighted approach that requires coaching for them to see how much more valuable they can be to the organization by getting out of the way of stifling progress!
Solution: Automate everything possible from testing, quality assurance, security compliance, code compliance (which means you need a good architectural review board and standards), etc! Automation is the gift that keeps on giving and is part of the “secret sauce” of top companies who are our clients.
Not Empowering Teams to Get Stuff Done!
Problem: Often teams work in a silo, only focused on their own tasks and are quick to blame others for their lack of success. They have been delegated tasks, but do not have the ability to get stuff done.
Solution: Similar to cross functional teams, each team must also be given the authority to make decisions (hence why you want the right people from a variety of dependencies on the team) and get stuff done. An empowered team will iterate much faster and likely with a lot more innovation.
While each organization will have many variables both enabling and hindering success, the items listed here are common denominators we see time and time again often needing an outside perspective to identify. Back to the ugly couch analogy, it is often easy to walk into someone else’s house and immediately spot their ugly couch!
Pay attention to those you have hired away from the competition in their early days and seek their opinions and input as your organization’s old bad habits likely look ridiculous to them. Of course only do this with an intent to listen and to learn – getting defensive or stubbornly trying to explain why things are the way they are will not only bring a dead end to you learning, but will also abruptly stop any budding trust with your new hire.
And of course, we are always more than happy to pop the hood and take a look at your organization just as we have been doing for the top banks, Fortune 100, healthcare, and many other organizations. Put our experience to work for you!
Subscribe to the AKF Newsletter
Effective Incident Communications
September 17, 2018 | Posted By: Bill Armelin
Everything fails! This is a mantra that we are always espousing at AKF. At some point, these failures will manifest themselves as an outage. In a SaaS world, restoring service as quickly as possible is critical. It requires having the right people available and being able to communicate with them effectively. A lack of good communications can cause an incident to drag on.
For startups and smaller companies, problems with communications during incidents is less of an issue. Systems tend to be smaller or monolithic. Teams supporting these systems also tend to be small. When something happens, everyone jumps on a call to figure out the problem. As companies grow, the number of people needed to resolve an incident grows. Coordinating communications between a large group of people becomes difficult. Adding to the chaos are executives joining the conference bridges demanding updates about service restoration.
In order to minimize the time to restore a system during an incident, companies need the right people on the call. For large, complex systems, identify the right resources to solve a problem can be difficult. We recommend swarming an issue with everyone that could be needed to resolve an incident, and then release those that are no longer needed. But, with such a large number of people, it can be difficult to coordinate communications, especially on a single conference call bridge.
Managing the communications of a large group of people working an incident is critical to minimizing the restoration time. We recommend a communication method that many of us at AKF learned in the military. It involves using multiple voice and chat channels to coordinate work and the flow of information. Before we get into the details of managing communications, we need to first look at the leadership required to effectively work the incident.
Technical Incident Manager and Incident Communications Manager
Managing a large incident is usually too much for a single individual. She cannot manage coordinating the work occurring to resolve the incident, as well as reporting status to and answering questions from executives eager to know what is going on. We recommend that companies manage incidents with two people. The first person is the individual that is responsible for directing all activities geared towards restoration of service. We call this person the Technical Incident Manager. This individual’s main job is to reduce the mean time to restoration. She needs an overall architectural knowledge of the product and systems to direct the work. She is responsible for leading the call and deescalating after diagnosis informs who needs to be involved. She identifies and diagnoses the service issues and engages the appropriate subject matter experts to assist in restoration.
The second individual is the Incident Communications Manager. He is responsible for supporting the Technical Incident Manager be listening to the technical resolution chatter and summarizing it for a non-technical audience. His focus is on communications speed, quality, and accuracy. He is the primary communications channel for both internal and external messaging. He owns the incident communications process.
Incident Communications Process
This process involves using multiple communication channels to control information and work performed. The first channel established is the Control Channel. This is in the form of a conference bridge and a chat channel. The Technical Incident Manager controls both of these channels. The second channel created is the Status Channel. This also has a voice bridge and a chat channel. The Incident Communication Manager is responsible for managing this channel.
The Control Channel is used for all communication related to the restoration of service. People only use the voice channel for immediate communication and to announce work that is occurring or address immediate questions that need to be answered. Detailed work conducted is placed in the chat channel. This reduces the chatter on the voice channel to command and control messages. It also serves as a record of actions taken that can be referenced in the post mortem/RCA process. If specific teams need to discuss the work they are performing, separate voice and chat breakout channels are created for them. They move off the main channel into their breakout channels to perform the work. The leader of these teams periodically communicates status back up to the control channel.
As the work is progressing, the Incident Communications Manager monitors the Control Channel to provide the basis for his messaging. He formulates updates that he delivers over the Status bridge and chat channel. He keeps executives and customers informed of progress and status, keeping the control channel free of requests for frequent updates and dedicated to restoring service.
This method of communications has worked well in the military for years and has been adopted by many large companies to manage their incident communications. While it is overkill for small companies, it becomes an effective process as companies grow and systems become more complex.
Let us help your organization with incident and crisis management process.
Subscribe to the AKF Newsletter
Are you compromised?
September 14, 2018 | Posted By: Larry Steinberg
It’s important to acknowledge that a core competency for hackers is hiding their tracks and maintaining dormancy for long periods of time after they’ve infiltrated an environment. They also could be utilizing exploits which you have not protected against - so given all of this potential how do you know that you are not currently compromised by the bad guys? Hackers are great hidden operators and have many ‘customers’ to prey on. They will focus on a customer or two at a time and then shut down activities to move on to another unsuspecting victim. It’s in their best interest to keep their profile low and you might not know that they are operating (or waiting) in your environment and have access to your key resources.
Most international hackers are well organized, well educated, and have development skills that most engineering managers would admire if not for the malevolent subject matter. Rarely are these hacks performed by bots, most occur by humans setting up a chain of software elements across unsuspecting entities enabling inbound and outbound access.
What can you do? Well to start, don’t get complacent with your security, even if you have never been compromised or have been and eradicated what you know, you’ll never know for sure if you are currently compromised. As a practice, it’s best to always assume that you are and be looking for this evidence as well as identifying ways to keep them out. Hacking is dynamic and threats are constantly evolving.
There are standard practices of good security habits to follow - the NIST Cybersecurity Framework and OWASP Top 10. Further, for your highest value environments here are some questions that you should consider: would you know if these systems had configuration changes? Would you be aware of unexpected processes running? If you have interesting information in your operating or IT environment and the bad guys get in, it’s of no value unless they get that information back out of the environment; where is your traffic going? Can you model expected outbound traffic and monitor this? The answer should be yes. Then you can look for abnormalities and even correlate this traffic with other activities in your environment.
Just as you and your business are constantly evolving to service your customers and to attract new ones, the bad guys are evolving their practices too. Some of their approaches are rudimentary because we allow it but when we buckle down they have to get more innovative. Ensure that you are constantly identifying all the entry points and close them. Then remain diligent to new approaches they might take.
Don’t forget the most common attack vector - humans. Continue evolving your training and keep the awareness high within your staff - technical and non-technical alike.
Your default mental model should be that you don’t know what you don’t know. Utilize best practices for security and continue to evolve. Utilize external or build internal expertise in the security space and ensure that those skills are dynamic and expanding. Utilize recurring testing practices to identify vulnerabilities in your environment and to prepare against emerging attack patterns.
We commonly help organizations identify and prioritize security concerns through technical due diligence assessments. Contact us today.
Subscribe to the AKF Newsletter
September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-Axis, Y-Axis, and Z-Axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
Subscribe to the AKF Newsletter
The Importance of a Post Mortem
September 6, 2018 | Posted By: James Fritz
“An incident is a terrible thing to waste” is a common mantra that AKF repeats during its Engagements. And rightfully so as many companies have an incident response plan in place but stop there. Why are incidents so important? What is the true value in doing a proper Post Mortem and actually learning from an incident?
Incidents identify issues in your product. But if that is all you take out of an incident then you are missing out on so much more information that an incident can provide. An incident is the first step to identifying a problem that exists in your product, infrastructure processes, and perhaps, people. “But aren’t incidents and problems the same thing?” Not necessarily. An incident is a one time event. It can occur multiple times if you never address the problem, but it is not isolated.
Conducting a Post Mortem
Gather as many data points as possible shortly after an incident concludes and schedule a Post Mortem review meeting.
Start with the incident timeline. Sufficiently logging events over time provides ready access to the needed data for forensic analysis. From this information you can then start to identify what went wrong, when it went wrong and how quickly you were able to respond to it. The below definitions are all factors that need to be identified:
- Time To Detect: How quickly did you identify that an incident had occurred
- Time To Escalate: How quickly did you get everyone necessary to fix the incident involved
- Time To Isolate: How quickly did you stop the incident from affecting other portions
- Time To Restore: How quickly did the system get brought back up
- Time To Repair: How quickly did you fix the incident
This all leads to the Incident Timeline Analysis.
If you can gather information from several incidents and look at them in your Post Mortem review, then you can figure out where your biggest issues are when it comes to incidents and getting the system back up and running. It is not uncommon for us to see that it often takes longer to detect the incident than to restore from it. This could be mitigated with more monitoring at more appropriate positions then you currently have.
Or maybe the time to escalate is an issue. Why does it take so long to get the proper engineers involved? Maybe a real-time alert system is required or a phone tree. And it is important to track and measure total time of an incident as beginning with when it occured (not when it was reported) all the way through to when customers were back up at 100% (not just when your systems were restored).
Problem vs. Incident
How do you know if your incident is also a problem? It’s actually fairly easy to determine. If you have an incident, you have a problem. The scale of the problem may vary by incident but every incident is caused because of something larger than itself.
During our Technical Due Diligences we always want to know how companies categorize incidents vs. problems. If the company properly categorizes problems related to incidents, they will be able to answer “Can you rank your problems to show which cause the most customer impact?” Many times, they can’t - but that ranking is critical to show which problems to attack first.
An incident, at its core, is caused by a problem. If your product crashes anytime someone attempts to access it via an unapproved protocol, the incident is the attempted access. The problem may be an improper review of your architecture. Or it may be lack of QA. Identifying the problem is much more difficult than identifying the incident. Imagine you find a termite on your deck. This small pest could be considered an incident. If you deal with the incident and get rid of the termite everything is good, right? If you don’t look any further than the incident you can’t identify the problem. And in this case the problem could be exposed, untreated wood allowing termites to slowly eat away at the inside of your house.
If you are keeping proper documentation each time you conduct a Post Mortem review, then you will have a history that will start to paint of a picture of ongoing and recurring problems that exist. Remedying the problem will stop the incident from occurring in the same exact way in the future. But small variations of the incident can still occur. If you fix the problem then you are stopping future iterations of that incident from happening again.
Looking for additional help with your incident management process? We can help!
Subscribe to the AKF Newsletter
1 2 3 >