AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Engineering

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.


Comments Off

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off

The Purpose of QA

 

What is the purpose of functional testing, regression testing, load and performance testing, stress testing, and any other type of testing done at the end of the product development life cycle?  If you said something like, “to improve the quality of your product”, keep reading.  You cannot QA quality into your product.  The quality of your product or service is determined to a large degree long before any test is performed.  The reason for this is that QA’s purpose is not to ensure quality but rather to check if all the other quality affecters have been included, providing a warning if they have not been.

 

We would put forth an argument that feature prioritization and resource allocation is the very first step in determining the quality of your product.  Mess this up and you are building your product on a shaky foundation.  Ensuring that the product team has clear guidance on business priorities and that these do not change every week sets the ground work for a high quality product.  Changing direction is intensely distracting for the entire organization and should only be done when there is a clear business necessity.  A litmus test is that if a change in direction happens more than once per quarter there is a problem.  

The next crucial step in ensuring high quality is a set of well defined requirements that include the purpose, expected benefits, user functionality, and methods of verification.  Depending on the development methodology this set can be developed all at once or incrementally. 

Of course engineering has the largest and most direct role in determining the quality of the product.  A professional engineering shop that can continuously deliver high quality features are usually places that are a joy to work in and make everyone better for being part of the team.  Some things that a team such as this are likely to have in place are mentoring programs, coding standards, unit tests, logging framework, and even documentation requirements.    

Don’t make the mistake that so many technology executives do and either blame QA for poor quality or think that by dedicating more time or resources to QA your quality will improve.  Do this and you will likely get more warning signs such as more bugs but you will not improve the overall quality of the product.  For that you must look further back in the product development life cycle.


3 comments