AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Transaction processing

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off on Designing for Rollback

Scalability Warning Signs

Is your system trying to tell you that you're going to have scalability problems? We like to think that we couldn't have predicted problems at 10x our last year's traffic but there are often warning signs that we can heed if we know what to look for.

Unless you’re one of the incredibly lucky sites where the traffic spikes 100x overnight, scalability problems don’t sneak up on you. They give you warning signs that if you are able to recognize and react to appropriately, allow you to stay ahead of the issues. However, we’re often so head down getting the next release out the door that we don’t take the time to realize we’re experiencing warning signs until they become huge problems staring us in the face.  Here are a few of the warnings that we’ve heard teams talk about in the past couple of months that were clearly signs of problems on the horizon.

Not wanting to make changes – If you find yourself denying request for changes to certain parts of your system, this might be a warning sign that you have scalability issues with that component. A completely scalable system has components that can fail without catastrophic impact to customers. If you’re avoiding changes to a component because of the risk of problems this is a warning sign that you need to re-architect to eliminate or at least mitigate the risk.

Performance creep – If after each release you need to add hardware to a subsystem or you accept a performance degradation in a service you could have a scaling issue approaching quickly. Consistently increasing consumption of CPU or memory resources in a service with each release will lead you into an unsustainable situation. If today you’re comfortably sitting at 40% CPU utilization and you allow a modest 10% degradation in each release you have less than nine releases before you are well above 100% but the reality is you won’t get close to that without significant issues.

Investigating larger hardware – If you’ve started asking your vendors or VAR about bigger hardware you’re heading down the path of scalability problems. The scale of more computational resources per dollar is not linear, it’s closer to cubic or even exponential scales. Purchasing more expensive hardware might seem like the economical way out when you compare the cost of the first hardware upgrade versus developer time but run the calculation out several iterations. When you get to a Sun Fire™ E25K Server with Oracle Database 10g at a $6M price tag you might feel differently about the decision.

Asking vendors for advanced features – When you start exploring advanced options of your vendor’s software you’re likely heading down the path of increased complexity and this is a clear warning sign of scalability problems in your future. Besides potentially locking you into a vendor which lowers your negotiating power it puts the fate of your company in someone else’s hands, which wouldn’t make me sleep very well at night. See our post on using vendor features to scale for more information.

Watch out for these or similar warning signs that scalability problems are looming on the horizon. Dealing with the problems today while you have time to plan properly might not get you an award for being a firefighter but you’ll more likely deliver a quality product without costly interruption.

1 comment

Scalability Best Practices

Here are a baker’s dozen of items that we feel are Best Practices for Scalability:

  1. Asynchronous – Use asynchronous communication when possible. Synchronous calls tie the availability of the two services together. If one has a failure or is slow the other one is affected.
  2. Swim Lanes – Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
  3. Cache – Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
  4. Monitoring – Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
  5. Replication – Replicate databases for recovery as well as to off load reads to multiple instances.
  6. Sharding Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
  7. Use Few RDBMS Features – Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
  8. Slow Roll – Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
  9. Load & Performance TestingTest the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
  10. Capacity Planning / Scalability Summits – Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
  11. Rollback – Always have the ability to rollback a code release.
  12. Root Cause Analysis – Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
  13. Quality From The Beginning – Quality can’t be tested into a product, it must be designed in from the beginning.