Revisiting the 1:10:100 Rule
If you have any gray in your hair, you likely remember the 1:10:100 rule. Put simply, the rule indicates that the cost of defect identification goes up exponentially with each phase of development. It costs a factor of 1 in requirements, 10 in development, 100 in QA and 1000 in production. The increasing cost recognizes the need to go back through various phases, the lost opportunity associated with other work, the amount of people and systems involved in identifying the problem, and end user (or customer) impact in a production environment. In a 2002 study by the National Institute of Standards and Technology the estimated cost of software bugs was $59.5 billion annually, half the cost borne by the users and the other by the developers.
While there is an argument to be made that Agile development methods reduce this exponential rise in cost, Agile alone simply can’t break the fact that the later you find defects, the more it costs you or your customers. But I also believe it’s our jobs as managers and leaders to continue to reduce this cost between phases – especially in production environments. If the impact in the production environment is partially a function of 1) the duration of impact, 2) the degree of functionality impacted, and 3) the number of customers impacted, then reducing any of these should reduce the cost of defect identification in production. What can we do besides considering Agile methods?
There are at least three approaches that significantly reduce the cost of finding production problems. These are “swimlaning”, having the ability to roll back code in XaaS environments (our term for anything as a service), and real time monitoring of business metrics. These approaches affect the number of customers impacted and the duration of the impact respectively.
We think we might have coined the term “swimlaning” as it applies to technology architectures. Swimlaning, as we’ve written about on this blog as well as in the book, is the extreme application of the “shard” or “pod” concept to create strict fault isolation within architectures. Each service or customer segment gets its own dedicated set of systems from the point of accepting a request (usually the webserver) to the data storage subsystem tier that contains the data necessary to fulfill that request (a database, file system or other storage system). No synchronous communication is allowed across the “swimlanes” that exist between these fault isolation zones. If you swimlane by the Z axis of scale (customers) you can perform phased rollouts to subsets of your customers and minimize the percentage of your customer base that a rollout impacts. An issue that would otherwise impact 100% of your customers now impacts 1%, 5% or whatever the smallest customer swimlane is. If swimlaned by functionality, you only lose that functionality and the rest of your site remains functioning. The 1000x impact might now be 1/10th or 1/100th the previous cost. Obviously you can’t have less cost than the previous phase, as you still need to perform new work, but the cost must go down.
Ensuring that you can always roll back recently released code reduces the duration of customer impact. While there is absolutely an upfront cost in developing code and schemas to be backwards compatible, you should consider it an insurance policy to help ensure that you never kill your customers. If asked, most customers will probably tell you they expect that you can always roll back from major issues. One thing is for certain – if you lose customers you have INCREASED rather than decreased the cost of production issue identification. If you can isolate issues to minutes or fractions of an hour in many cases it becomes nearly imperceptible.
Monitoring Business Metrics
Monitoring the CPU, memory, and disk space on servers is important but ensuring that you understand how the system is performing from the customer’s perspective is crucial. It’s not uncommon to have a system respond normally to an internal health check but be unresponsive to customers. Network issues can often provide this type of failure. The way to ensure you catch these and other failures quickly is to monitor a business metric such as logins/sec or orders/min. Comparing these week-over-week e.g. Monday at 3pm compared to last Monday at 3pm, will allow you to spot issues quickly and rollback or fix the problem, reducing the impact to customers.