Failing to design for rollback is designing for failure

Shoulder pads were in, Reagan just became president, and "Back to the Future" was released in the theaters. 1985 will be remembered for many reasons, but for the Coca-Cola company, it will always be known for one of the biggest marketing and business blunders in history.

To take market share from its rival Pepsi, the Coca-Cola company decided to change its formula for Coke which had been in production for 99 years. When news of a new formula was released, uproar across the country exploded. Even before tasting the new formula, Coke consumers decided they did not want a change. Some folks went so far as to stockpile Coke in their basements for emergencies.

Ignoring the uproar, Coca-Cola pressed forward with releasing the new formula and halting production on the old. Protest groups quickly sprung up, songs were written about the old Coke, and news reports were EVERYWHERE. Within five months, Coca-Cola had rolled back to the previous formula and rebranded old Coke as "Classic."

The adage, no press is bad press definitely applies to the New Coke vs. Classic Coke debate. However, if you take one thing from this anecdote, it is: Prepare to roll back. Because unless your brand, software, or application has a following as loyal as Coco-Cola, the ramifications of NOT preparing for rollback could be disastrous.

Is preparing for rollback REALLY necessary?

When we think of failure, it's often in the context of an experiment or test that didn't work out as planned. We tend to think of these things as aberrational: an unfortunate deviation from a perfect track record of success. But failure is more common than that. Failure is something all designers, engineers, and developers must reckon with and establish a plan to mitigate.

Within software development, a good failure-mitigating plan usually includes halting the release, implementing code fixes, or rolling back the deployment. Each method has its pluses, minuses, and specific conditions where they are best suited. Although each technique might eventually be used, the one that should always be planned for is the ability to roll back.

Engineers who fail to account for rollback inevitably discover this at some point, typically when they find themselves unable to fix something. How is this possible? Because failing to consider rollback leaves you no way back after your attempt fails.

In "Scalability Rules: Principles for Scaling Web Sites," Marty Abbott, founder of AKF Partners and previous CTO of eBay, tells the story of an update PayPal was making to the timing of payment transactions between accounts. Although their parent company eBay had a consistent rollback strategy per release, PayPal, at that time, didn't see value in ensuring releases could be rolled back to their previous version. Instead, they opted for the fix-forward approach to save time and development costs. Unfortunately, the code release to change the timing of payment transactions was riddled with problems and made it incredibly difficult for transactions to be processed properly for numerous days. This caused problems for any customer wanting to transact on eBay, and it led to the CEO of eBay releasing an apology to their customers.

Because of PayPal's fix-forward strategy and decision to save development cost by not preparing for rollback, any possible cost savings PayPal thought they had incurred was eaten up 100-fold by the revenue lost to failed transactions.

After PayPal remedied the issues with their latest deployment, they set out to ensure the ability to roll back all future releases. When they laid out what was needed to prepare for rollback, they found it wasn't nearly as daunting as they initially thought. In fact, according to Chuck Geiger, PayPal's CTO at the time, the cost to enable rollback wasn't high, and they were able to implement the core structure in as little as two weeks.

Elements of Good Rollback Design

Most of the work in preparing for rollback is managing changes to the database. This may take a little reworking, and you may have to get your engineering team fully bought into the new direction. But once implemented, it should just be a matter of adhering to some simple rules to consistently be able to roll back.

  • Database changes must only be additive- Columns or tables should only be added, not deleted until the next version of code is released that deprecates the dependency on those columns. Once these standards are implemented, every release should have a portion dedicated to cleaning up the last release's data that is no longer needed.
  • Database changes scripted and tested- The database changes that are to take place for the release must be scripted ahead of time instead of applied by hand. This should include the rollback script. The two reasons for this are that (1) the team needs to test the rollback process in QA or staging to validate that they have not missed something that would prevent rolling back, and (2) the script needs to be tested under some amount of load condition to ensure that it can be executed while the application is using the database.
  • Restricted SQL queries in the application- The development team needs to remove all ambiguity from SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data- The development team must not change the definition of data within a release. An example would be a column in a ticket table that is currently being used as a status flag indicating three values such as assigned, fixed, or closed. The new version of the application cannot add a fourth status until the code is first released to handle the new status.
  • Wire on/wire/off- The application should have a framework added that allows code paths and features to be accessed by some users and not by others, based on an external configuration. This setting can be in a configuration file or database table and should allow for both role-based access as well as access based on a random percentage. This framework allows for beta testing of features with a limited set of users and allows for the quick removal of a code path in the event of a major bug in the feature without rolling back the entire code base.

The additional engineering work and testing to make any change backward compatible will have the greatest ROI of any work you can do. If they are implemented and adhered to going forward, you will have rollback capabilities, thus limiting your liability and risk. Yes, planning for rollback takes time. But not preparing for rollback can take market share, customer confidence, and revenue if there is an issue.

Check out 'The Top 20 Mistakes in Technology' for other technology pitfalls to prepare for. Whether or not Benjamin Franklin did say, "By failing to prepare, you are preparing to fail," the sentiment remains. If you are not actively mitigating against possible failures, then you are not setting yourself up for success. As a business, success is your #1 OKR; make sure you deliver on it by preparing for failure.

If you need help implementing any of the above strategies, contact us. AKF helps companies at various stages of growth, and we would love to be of assistance to yours.