AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Engineering

Engineering Efficiency

Recently, several of our clients have been interested in how they can make their engineers more efficient. The need for this usually arises when someone notices that they used to deliver much more when the team was smaller. This is a very common problem because as the teams grow larger more coordination is required and technical debt builds up.

Our first recommendation is to measure. As we’ve pointed out before in our post, Data Driven Decisions, without data you are just guessing at the solution. More importantly there is no way of knowing if you are improving anything unless you have the data. We recommend you collect data on where your engineers spend time and produce the following ratio:
actual engineering time spent on development (per time period) / available engineering time (per time period)

The numerator is how much time the engineers are spending building products. What gets taken out of this number are meetings not directly associated with development (design meetings are part of the development process), tasks such as building their environments, and firefighting such as production bug fixes. The denominator includes the average time available per that period. Typically, vacation time, holidays, etc are removed from this time. Once you’ve identified this ratio you have a good idea what tasks are taking away time from engineers actually building products. When our clients calculate this they often see ratios as low as 40%.

One of the largest culprits of reduced engineering efficiency are non-product development related meetings. A simple fix for this is to set aside 4 hr blocks of no-meeting time for engineers to work. We typically recommend 8am – noon as non-meeting, noon – 2pm for meetings, and then 2pm – TBD for non-meetings. This does two things, first it gives everyone time to get actual work done and secondly it forces people to prioritize meetings and limit who should attend since they all have to occur in a 2 hr window.

Start measuring your engineers efficiency and see what you can change to make it improve.


Comments Off

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.


Comments Off

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off

The Purpose of QA

 

What is the purpose of functional testing, regression testing, load and performance testing, stress testing, and any other type of testing done at the end of the product development life cycle?  If you said something like, “to improve the quality of your product”, keep reading.  You cannot QA quality into your product.  The quality of your product or service is determined to a large degree long before any test is performed.  The reason for this is that QA’s purpose is not to ensure quality but rather to check if all the other quality affecters have been included, providing a warning if they have not been.

 

We would put forth an argument that feature prioritization and resource allocation is the very first step in determining the quality of your product.  Mess this up and you are building your product on a shaky foundation.  Ensuring that the product team has clear guidance on business priorities and that these do not change every week sets the ground work for a high quality product.  Changing direction is intensely distracting for the entire organization and should only be done when there is a clear business necessity.  A litmus test is that if a change in direction happens more than once per quarter there is a problem.  

The next crucial step in ensuring high quality is a set of well defined requirements that include the purpose, expected benefits, user functionality, and methods of verification.  Depending on the development methodology this set can be developed all at once or incrementally. 

Of course engineering has the largest and most direct role in determining the quality of the product.  A professional engineering shop that can continuously deliver high quality features are usually places that are a joy to work in and make everyone better for being part of the team.  Some things that a team such as this are likely to have in place are mentoring programs, coding standards, unit tests, logging framework, and even documentation requirements.    

Don’t make the mistake that so many technology executives do and either blame QA for poor quality or think that by dedicating more time or resources to QA your quality will improve.  Do this and you will likely get more warning signs such as more bugs but you will not improve the overall quality of the product.  For that you must look further back in the product development life cycle.


3 comments