Archive for the ‘Uncategorized’ Category

Log Every Change

Monday, May 24th, 2010

In well run technology organizations, any event that has the potential of impacting customers will trigger an alert that brings a cross disciplinary team together in person or on the phone to start troubleshooting the potential (or actual) problem.  Ideally the person responsible for running the incident management and problem resolution process will ask what most recently changed and then listen (or read) as the operations team reads (or displays) the change log.  We often joke that you only need to wait for someone to say “Yeah, but that change couldn’t possibly have caused this issue” to find the root cause and fix the problem.

In our experience, changes are one of the most common cause o f customer and revenue impacting issues.  Sometimes these changes are feature enhancements or functionality additions, and sometimes they are infrastructure or architectural changes.  Very often, they are simple configuration changes like an addition of a range of IP Addresses to an access control list, or the modification of DNS.  In some companies, these changes (identified as any modification to a production environment other than that made by the actual software or system itself) happen at a rate of several thousand per day.  It is virtually impossible to track them unless a change logging system is put in place.  Very often, it is the change that is undocumented and therefore difficult to isolate and roll back that costs the company the greatest downtime or revenue.

Too many companies allow too many changes to go undocumented.  The most commonly cited reason for a lack of change logging is that it simply takes too long to log each and every change.  But change logging doesn’t have to be cumbersome and it need not always include the notion of risk management inherent to a change management system.  Just the logging of a change for later identification can save between hundreds and millions of dollars of revenue and hundreds or thousands of customers – especially in a SaaS environment.  Something as simple as always logging the time, date, reason for a change, person making the change and the system being modified can make a world of difference.  Many web enabled tools offered by companies like Service-Now make such logging very simple.  Most tools offer smtp interfaces that allow people to make a change and email it to the system.  For a minute or two of time per change, hours can be saved in customer impact.

Log your changes – every change, every time.

Revisiting the 1:10:100 Rule

Wednesday, April 28th, 2010

If you have any gray in your hair, you likely remember the 1:10:100 rule.  Put simply, the rule indicates that the cost of defect identification goes up exponentially with each phase of development.  It costs a factor of 1 in requirements, 10 in development, 100 in QA and 1000 in production. The increasing cost recognizes the need to go back through various phases, the lost opportunity associated with other work, the amount of people and systems involved in identifying the problem, and end user (or customer) impact in a production environment. In a 2002 study by the National Institute of Standards and Technology the estimated cost of software bugs was $59.5 billion annually, half the cost borne by the users and the other by the developers.

While there is an argument to be made that Agile development methods reduce this exponential rise in cost, Agile alone simply can’t break the fact that the later you find defects, the more it costs you or your customers.   But I also believe it’s our jobs as managers and leaders to continue to reduce this cost between phases – especially in production environments.  If the impact in the production environment is partially a function of 1) the duration of impact, 2) the degree of functionality impacted, and 3) the number of customers impacted, then reducing any of these should reduce the cost of defect identification in production.  What can we do besides considering Agile methods?

There are at least three approaches that significantly reduce the cost of finding production problems.  These are “swimlaning”, having the ability to roll back code in XaaS environments (our term for anything as a service), and real time monitoring of  business metrics.  These approaches affect the number of customers impacted and the duration of the impact respectively.

Swim Lanes

We think we might have coined the term “swimlaning” as it applies to technology architectures.  Swimlaning, as we’ve written about on this blog as well as in the book, is the extreme application of the “shard” or “pod” concept to create strict fault isolation within architectures.  Each service or customer segment gets its own dedicated set of systems from the point of accepting a request (usually the webserver) to the data storage subsystem tier that contains the data necessary to fulfill that request (a database, file system or other storage system).  No synchronous communication is allowed across the “swimlanes” that exist between these fault isolation zones.  If you swimlane by the Z axis of scale (customers) you can perform phased rollouts to subsets of your customers and minimize the percentage of your customer base that a rollout impacts.  An issue that would otherwise impact 100% of your customers now impacts 1%, 5% or whatever the smallest customer swimlane is.  If swimlaned by functionality, you only lose that functionality and the rest of your site remains functioning.  The 1000x impact might now be 1/10th or 1/100th the previous cost.  Obviously you can’t have less cost than the previous phase, as you still need to perform new work, but the cost must go down.

Rollback

Ensuring that you can always roll back recently released code reduces the duration of customer impact.  While there is absolutely an upfront cost in developing code and schemas to be backwards compatible, you should consider it an insurance policy to help ensure that you never kill your customers.  If asked, most customers will probably tell you they expect that you can always roll back from major issues.   One thing is for certain – if you lose customers you have INCREASED rather than decreased the cost of production issue identification.  If you can isolate issues to minutes or fractions of an hour in many cases it becomes nearly imperceptible.

Monitoring Business Metrics

Monitoring the CPU, memory, and disk space on servers is important but ensuring that you understand how the system is performing from the customer’s perspective is crucial. It’s not uncommon to have a system respond normally to an internal health check but be unresponsive to customers. Network issues can often provide this type of failure. The way to ensure you catch these and other failures quickly is to monitor a business metric such as logins/sec or orders/min. Comparing these week-over-week e.g. Monday at 3pm compared to last Monday at 3pm, will allow you to spot issues quickly and rollback or fix the problem, reducing the impact to customers.

Data Driven Decisions

Wednesday, April 14th, 2010

By now most of us have heard concepts such as the wisdom of crowds or A/B testing but still so often we make decisions without gathering data. Admittedly not every decision we make during our busy days requires data analysis but the ones that matter such as your product’s UI redesign, a price change, or advertisements often get the same treatment as your choice of lunch sandwiches. Perhaps you or someone on your team claims such connection with customers or product expertise as to not require testing. Don’t believe this!

Allow me to share with you an antecdote from my past that shows differently. Some of the facts of this story have been obfuscated to protect intellectual property but the gist of it remains true. The company that I was working with sold a product that allowed customers who purchased it to receive a return on their investment in a variable amount of time depending on how they configured the product. When getting up to speed on the product I asked everyone from the CEO to customer account managers, many who had been working with in this field for years, which was the optimal configuration. Everyone suggested a particular configuration. Being a bit of a stats geek from my days in Six Sigma I grabbed some data and started analyzing. The initial results shocked everyone because they indicated the exact opposite of the “optimal configuration”. After a complete A/B test the company ended up building a practice and product around the new ideal configuration, a big win for customers.

Ian Ayers in his book Super Crunchers offers several examples of random testing from companies as diverse as Monster.com to Capital One that have resulted in tens of millions of dollars of increased revenue. Companies such as Offermatica and Google offer A/B or even multivariate testing. Ian actually used this same techinque through online advertisements to determine the title of his book. Tim Ferriss in his book Four Hour Work Week did something very similar and recommends this approach to quick testing with advertisements for everything from business ideas to homepage redesigns.

While we caution against analysis paralysis there is a middle ground. Our mantra for processes is “Right Time, Right Process” meaning you need the process that fits best today for the your team and for the task. As we state in The Art of Scalability “Each and every process must be evaluated first for general fit within the organization in terms of its rigor or
repeatability and then specifically for what steps are right for your particular team in terms of complexity.” The bottom line is, for decisions that matter get the necessary amount of data to make the best decision.

Happy New Year

Thursday, December 31st, 2009

This year there are no New Year’s resolutionstechnical prognostications, or wish lists, just a huge “Thank You” to all of our friends for making 2009 a terrific year and wishing everyone a happy and scalable 2010.

Is Anyone Really Surprised?

Monday, August 3rd, 2009

The folks at 37Signals posted last week about how their very cool product Basecamp, which by the way we use for our book project, has more vroom.  They claim that they have “…cut response times to about 1/3 of their previous levels even when handling over 20% more requests per minute.”  How did they do this you ask?  They were running their own private compute cloud with virtualized instances using Kernel Based Virtual Machine (KVM). As they state, “To make a long story a little less long, we saw some pretty extreme performance improvements from moving Basecamp out of a virtualized environment and back onto dedicated hardware.”

The posting by Zach at 37Signals didn’t imply that they were surprised by the improvement and my point is that no one should be. As we’ve stated in The Cloud Isn’t For Everyone and several other posts, virtualization is not free nor is it magical. It requires CPU cycles and memory. There are definitely advantages to running a private cloud but improved performance over dedicated hardware is not one of them. As in the comments to Zach’s post, I know all the virtualization fans are screaming that they should have tried VMWare ESXi or Xen instead of KVM but as MI states in the comments “I didn’t mean to imply it, but I will I say it straight out: Dedicated is faster than virtualization.” And no one should be surprised by that.