What is Technical Debt and how does it relate to overall application health?
Technical debt or more commonly “tech debt” is an analogy to financial debt that was credited to Ward Cunningham initially in 1992. The idea of tech debt is that you make a conscious decision to cut corners in the implementation of an application or feature with the benefit of faster time to market. The analogy to financial debt is appropriate and powerful if implemented and tracked properly using the following attributes and qualifiers:
- The principal of tech debt is the effort you save or forego by taking on the debt.
- The interest of tech debt is the ramification of taking on the principal. This may be higher rate of defects, lower availability, costlier development on the implementation in question, etc.
- The term is only used for acts of commission – specific decisions made to save cost of development for a less than optimal implementation.
As a corollary to the above, acts of omission (mistakes, defects, etc) can never be debt. Just as in financial markets, if you find yourself holding debt you did not initiate, it is “fraud”.
The above savings (the debt) does not “dilute” the effort of the team. Rather than increasing equity (paying more for a single item) and producing less, the team makes an accretive decision to do MORE but with less value (hence the debt to equity comparison)
Just as with financial debt, you must be committed to ultimately returning the principal and of course addressing the interest payments.
Unfortunately, the term tech debt has become conflated over time with many other aspects of application health. Lumping in many other aspects of platform hygiene with tech debt makes it harder to discuss trade-offs in prioritizing tech debt and simply for product engineering teams to align on how platform hygiene should be managed and prioritized. Further, when we combine both acts of omission (again mistakes, or defects and other things from which we can learn) with acts of comission (short cuts for the purposes of faster time to market) we defeat the value of the analogy and further sow seeds of distrust with business partners. The net result is that platform health is neglected: quality suffers, trust between engineering and the business partners declines and business outcomes suffer (time to market, quality, revenue, profits, security etc.).
Now knowing the definition of Tech Debt (TD), let us look at other ongoing engineering work that impacts platform health but is NOT TD. The following are examples of ongoing maintenance work that impacts a platform’s health but do not fit the definition of TD:
- Tracking and managing the level of defects in production
- Ongoing refactoring of code to manage complexity, optimize design patterns, simplify etc.
- Currency upgrades such as incorporating a newer version of a library or framework
- Closing security issues
The following guide to managing this kind of work addresses the ongoing management of both TD and ongoing maintenance. The reason to consider these buckets of work together is that they are both groupings of work that are heavily engineering focused, as opposed to the introduction of a new significant product feature. Given that engineering capacity must be prioritized between application health (TD and maintenance) and new feature functionality, we must have a common understanding of these bodies of work in order to prioritize them in alignment with business outcomes (e.g. revenue, product usage, uptime, etc.).
Why is it important?
Just like with financial debt, if you do not pay down your tech debt there will eventually be negative consequences. When tech debt is not paid down then the application will decay in quality over time resulting in:
- Slower time to market as it is more complex and fragile to make changes in the code
- More and longer customer impacting incidents
- A rift between Engineering and Product based on misaligned priorities and eventually deteriorating (customer impacting) quality
- Lower engineering morale as all professionals seek mastery and are demoralized by not having time to address tech debt or application health needs
- Poor team dynamics, especially across engineering and product as finger pointing ensues when quality deteriorates but engineering is not empowered to prioritize and address the root causes.
In short, if you do not draw down tech debt and proactively manage application health related work, you will start drowning in debt and it will severely slow down your delivery, reduce quality, as well as worsen morale and team collaboration dynamics.
Key points for CEOs and Board members
- Be clear on how your team is managing tech debt and platform health related work. Ideally the team has a way to categorize tech debt and application health/maintenance work in the system they use to manage their backlog of stories so that all stakeholders have a transparent and common way of categorizing, discussing, prioritizing, and managing this work. This can be accomplished by simply adding a tag or attribute to TD and platform health related stories in the teams' agile backlog for easy and consistent tracking and aggregation.
- Encourage your engineering teams to keep track of roughly the amount of effort/time spent on tech debt and application health as it enables better planning and more concrete/data driven trade off discussions.
- When categorizing this work, differentiate between servicing tech debt (interest and capital) and basic care and feeding/proactively maintaining ongoing application health as previously outlined. Making conscious decisions to create tech debt based on time to market imperatives is something entirely different than just poor coding practices, production defect management, security issues etc. - and the team should understand both subcategories of basic system management.
- Typically, teams should spend ~20-25% of their capacity on this type of care and feeding work (i.e. tech debt interest and capital draw-down plus basic ongoing application maintenance) BUT be aware this percentage can ebb and flow based on things like emerging security threats or major currency upgrades. Ideally the percentage does NOT ebb and flow because you have been neglecting tech debt draw-down and application health on an ongoing basis. If the capacity percentage allocated to application health ongoing remains steady and adequate, the chance of ‘surges’ of work that impact new product capability delivery will be much more predictable/less volatile ongoing.
- Vocally support a percentage of ONGOING spend (engineering effort) on application health at the senior management level. It is ESPECIALLY important to have management support to take diligent care of your applications’ health. In addition, when a business is under pressure there is a tendency to go in to ‘feature factory’ mode and neglect basic hygiene. This may be ok for a brief period of time but over the medium to longer term this neglect in favor of cranking out features usually has the opposite effect as intended regarding customer outcomes (I.e. negatively impacting customer satisfaction leading to business growth). I have seen it time and time again when a business is not growing - users are not really looking for new wiz bang features but just want the product to deliver on the basic value proposition well.
- In addition to supporting an ongoing allocation to tech debt and application health work, it is ideal to keep a % of time allocated as suggested above and have a system of categorizing this work for ongoing transparency BUT do not ask Product to get heavily involved in prioritizing every single item in the tech debt and application health backlog. It is much more efficient to keep a reasonable amount of time allocated to application health on an ongoing basis and if transparency exists, then Engineering should be able to provide brief periodic updates on the work in these buckets (tech debt and application health) and the business benefits of it. There should, however, be a healthy dialogue regarding when tech debt is addressed, as well as when major application health efforts (e.g. complex/large currency upgrades) are planned given the percentage of engineering time spent in these cases may need to be significantly higher than 20-25% temporarily, but there is limited value to having product prioritize every single item in the application health category on an ongoing basis.
- Make sure goals are shared between product and tech as it relates to application health outcomes. It is sadly also quite common to see objectives regarding production quality, uptime and the like being ONLY allocated to the engineering team. This is a BIG and fatal mistake. It is critical that product have skin in the game regarding ALL important aspects of product engineering outcomes. Unfortunately, we commonly see these health-related goals only allocated to Engineering and therefore the work does not get prioritized and the primary blame for not addressing application health then lands with Engineering and results in poor customer quality, low engineering morale and poor collaboration between Product and Engineering (finger pointing).
- Along this theme, encourage x-training where needed between product and technology. Many product professionals have come from Business (as opposed to engineering) backgrounds and really do not understand the criticality of ongoing care and feeding as well as the challenges. This can go both ways in basic Product Management skills for Engineers as well as Engineering skills for Product Mangers.
Common CEO questions about Tech Debt
What is the right amount of tech debt?
Similar to financial debt there is no one size fits all. For example, in a startup that is just beginning to explore product/market fit a high amount of tech debt may be appropriate until there is enough hypothesis validation to define the scope for a full minimum viable product. Therefore it is appropriate to add debt for new growth opportunities - but you must be prepared to pay it down. In contrast, for a well-established product, the level of tech debt would normally be relatively lower overall compared to a burgeoning startup (although it may ebb and flow with major introductions of new capabilities) and should be guided by outcome metrics in terms of time to market and quality. The most important thing you can do to optimize the level of technical debt is to define and measure it so that it is possible to have aligned and concrete conversations about tech debt and application health between the key stakeholders in engineering, product and the business/product marketing.
How do I know when Tech Debt and/or poor platform overall health are impacting the business negatively?
The key indicators of slower delivery are time to market which is commonly measured with metrics such as Lead Time For Change (LT4C) which measures the time from when code is committed to the main production code line and when it is live in production for customers and Deployment Frequency (DF) which measures the number of deployments per day (frequent deployment as a proxy for time to introduce new capabilities). Time to actually code/implement a change can also be approximated using data from the source code repository and the agile management system (e.g. Jira).
There are a multitude of metrics to consider regarding quality but the most critical ones align to the quality the customer experiences in production. Common metrics to monitor quality outcomes include:
- Change Failure Rate (CFR)
- Mean Time To Restore (MTTR) service in the case of a customer impacting incident
- Uptime (related metrics: service level agreements, service level objectives and error budgets)
- Total bugs in production, with a focus on higher priority/severity defects as well as defects that have also been reported to service teams by end customers
- Cost to scale infrastructure can also be a second order data point as often there are elements of scalability if the platform health is suboptimal.
In short, if time to market, quality or cost indicators are going in the wrong direction or out of the optimal range your team is targeting, then the reasons why should be analyzed including aspects of tech debt and basic good application health practices.
What Tech Debt is not
As previously outlined, Tech Debt is NOT:
- Just poor coding practices
- Any other mistakes or errors of omission
- ALL the other application care and feeding/maintenance work that needs to be done to keep an application healthy
Tech Debt is also NOT something that Engineering should own alone. Decisions to take on tech debt should always be discussed and agreed in partnership with Product including a high-level plan of when the debt would potentially be drawn down (e.g. “if new experimental feature X is adopted by Y% of users then debt should be prioritized within one business quarter” or if the change/feature is not adopted then the experimentation feature related code should be removed by a certain timeframe)
Managing Tech Debt and application health is harder in practice than it sounds as it creates a natural potential point of tension between product and engineering. It can also feel that tech debt and application health work ‘take away’ time from delivering customer value but nothing could be further from the truth. Being a good steward of overall application health and therefore having a stable application is table stakes to have the ‘license to innovate,’ or put another way, if you are very regularly fighting fires due to poor application health – you are probably not getting many new features out the door anyway. Bringing transparency, discipline and shared engineering and product goals related to application health outcomes can really help bring alignment to optimizing this category of work. At AKF, many of our clients struggle with mounting tech debt, including when quality, complexity and time to market have deteriorated to an overwhelming point where it is hard to even know where to start to address it. Please contact us and we can help!