AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Engineering

The Biggest Mistake with Agile

At least 75% of the dev shops that we see are using some form of Agile. Very few are following a pure form of any specific flavor i.e. Scrum, Extreme Programming, etc. but rather most are using some hybrid method. Some teams measure velocity while other don’t. Some teams have dedicated ScrumMasters while others have the engineering managers perform this role. While most team’s processes could be tweaked, none of these are the real problem.

Scrum team

The biggest mistake companies make implementing Agile and thus the cause of most of their problems is they don’t understand that Agile is a business process, not a software development methodology. Thus, the business owners or their delegates, product managers, must be involved at every step.

We’ve argued before that Agile teams must sit together because communication degrades at a rate of square the distance. Not having product managers with the Agile team involved in the entire process and (if you’ve moved from a Waterfall methodology) not having detailed specifications, is the worst possible scenario. Developers either need someone siting beside them to help with product decisions (Agile) or a detailed spec to work from (Waterfall).

Agile is a business process which requires the business to be involved in the product development process. It does not mean you get to stop writing specs and not be involved.


Engineering Metrics

A topic that often results in great debate is “how to measure engineers?” I’m a pretty data driven guy so I’m a fan of metrics as long as they are 1) measured correctly 2) used properly and 3) not taken in isolation. I’ll touch on these issues with metrics later in the post, let’s first discuss a few possible metrics that you might consider using. Three of my favorite are: velocity, efficiency, and cost.

  • Velocity – This is a measurement that comes from the Agile development methodology. Velocity is the aggregate of story points (or any other unit of estimate that you use e.g. ideal days) that engineers on a team complete in a sprint. As we will discuss later, there is no standard good or bad for this metric and it is not intended to be used to compare one engineer to another. This metric should be used to help the engineer get better at estimating, that’s it. No pushing for more story points or comparing one team to another, just use it as feedback to the engineers and team so they can get more predictable in their work.
  • Efficiency – The amount of time a software developer spends doing development related activities (e.g. coding, designing, discussing with the product manager, etc) divided by their total time available (assume 8 – 10 hours per day) provides the Engineering Efficiency. This is a metric designed to see how much time software developers are actually spending on developing software. This metric often surprises people. Achieving 60% or more is exceptional. We often see dev groups below 40% efficiency. This metric is useful for identifying where else engineers are spending their time. Are there too many company meetings not directly related to getting products out the door? Are you doing too many HR training sessions, etc? This metric is really for the management team to then identify what is eating up the non-development time and get rid of it.
  • Cost – Tech cost as a percentage of revenue is a good cost based metric to see how much you are spending on technology. This is very useful as it can be compared to other tech (SaaS or other web-based companies) and you can watch this metric change over time. Most startups begin with their total tech cost (engineers, hosting, etc) at well over 50% of revenue but this should quickly reduce as revenue grows and the business scales. Yes, scaling a business involves growing it cost effectively. Established companies with revenues in the tens of millions range usually have this percentage below 10%. Very large companies in the hundreds of millions in revenue often drive this down to 5-7%.

Now that we know about some of the most common metrics, how should they be used? The most common way managers and executives want to use metrics is to compare engineers to each other or compare a team over time. This works for the Efficiency and the Cost metrics, which by the way are primarily measurements of management effectiveness. Managers make most of the cost decisions including staffing, vendor contracts, etc. so they should be on the hook to improve these metrics. In terms of product out the door as measured by story points completed each sprint a.k.a. Velocity, as mentioned above, is to be used to improve estimates, not try to speed up developers. Using this metric incorrectly will just result in bloated estimates, not faster development.

An interesting comparison of developers comes from a 1967 article by Grant and Sackman in which they stated a ratio of 28:1 for the time required by the slowest versus the fastest programmer to complete a task. This has been a widely cited ratio but a paper from 2000 revised this number to 4:1 at the most and more likely 2:1. While a 2x difference in speed is still impressive it doesn’t optimize for the overall quality of the product. An engineer who is very fast and with high quality but doesn’t interact with the product managers isn’t necessarily the overall most effective. My point is that there are many other factors to be considered than just story points per release when comparing engineers.


Agile Communication

In Agile software development methodology the communication between team members is critical. Two of the twelve principles deal with directly with this issue:

  • Business people and developers must work together daily throughout the project.
  • The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.

Despite the importance of this, time and time again we see teams that are spread out across floors and even buildings trying to work in an Agile fashion. Every foot two people are separated decreases the likelihood that they communicate directly or overhear something that they can provide input on. In physics, there is such a thing as an inverse-square law where a quantity is inversely proportional to the square of the distance from the source. Newton’s law of universal gravitation is example of an inverse-square law. I don’t believe anyone has associated this law with verbal communication but I’m convinced this is the case.

communication : 1 / (distance x distance)

This isn’t to say that remote teams can’t work. In fact, I’m actually a proponent of remote workers but I think because they are not in the building or across the street special arrangements are made like an open Skype call with the remote office. People might worry about being seen as lazy if they Skype someone across the room but across the country, that’s fine.

The take away of this is to put people working with each other as close together as possible. If you need to move peoples’ desks, do it. The temporary disruption is worth the gained communication over the length of the project.


Engineering Efficiency

Recently, several of our clients have been interested in how they can make their engineers more efficient. The need for this usually arises when someone notices that they used to deliver much more when the team was smaller. This is a very common problem because as the teams grow larger more coordination is required and technical debt builds up.

Our first recommendation is to measure. As we’ve pointed out before in our post, Data Driven Decisions, without data you are just guessing at the solution. More importantly there is no way of knowing if you are improving anything unless you have the data. We recommend you collect data on where your engineers spend time and produce the following ratio:
actual engineering time spent on development (per time period) / available engineering time (per time period)

The numerator is how much time the engineers are spending building products. What gets taken out of this number are meetings not directly associated with development (design meetings are part of the development process), tasks such as building their environments, and firefighting such as production bug fixes. The denominator includes the average time available per that period. Typically, vacation time, holidays, etc are removed from this time. Once you’ve identified this ratio you have a good idea what tasks are taking away time from engineers actually building products. When our clients calculate this they often see ratios as low as 40%.

One of the largest culprits of reduced engineering efficiency are non-product development related meetings. A simple fix for this is to set aside 4 hr blocks of no-meeting time for engineers to work. We typically recommend 8am – noon as non-meeting, noon – 2pm for meetings, and then 2pm – TBD for non-meetings. This does two things, first it gives everyone time to get actual work done and secondly it forces people to prioritize meetings and limit who should attend since they all have to occur in a 2 hr window.

Start measuring your engineers efficiency and see what you can change to make it improve.

Comments Off on Engineering Efficiency

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.

Comments Off on Cascading Failures

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off on Designing for Rollback

The Purpose of QA


What is the purpose of functional testing, regression testing, load and performance testing, stress testing, and any other type of testing done at the end of the product development life cycle?  If you said something like, “to improve the quality of your product”, keep reading.  You cannot QA quality into your product.  The quality of your product or service is determined to a large degree long before any test is performed.  The reason for this is that QA’s purpose is not to ensure quality but rather to check if all the other quality affecters have been included, providing a warning if they have not been.


We would put forth an argument that feature prioritization and resource allocation is the very first step in determining the quality of your product.  Mess this up and you are building your product on a shaky foundation.  Ensuring that the product team has clear guidance on business priorities and that these do not change every week sets the ground work for a high quality product.  Changing direction is intensely distracting for the entire organization and should only be done when there is a clear business necessity.  A litmus test is that if a change in direction happens more than once per quarter there is a problem.  

The next crucial step in ensuring high quality is a set of well defined requirements that include the purpose, expected benefits, user functionality, and methods of verification.  Depending on the development methodology this set can be developed all at once or incrementally. 

Of course engineering has the largest and most direct role in determining the quality of the product.  A professional engineering shop that can continuously deliver high quality features are usually places that are a joy to work in and make everyone better for being part of the team.  Some things that a team such as this are likely to have in place are mentoring programs, coding standards, unit tests, logging framework, and even documentation requirements.    

Don’t make the mistake that so many technology executives do and either blame QA for poor quality or think that by dedicating more time or resources to QA your quality will improve.  Do this and you will likely get more warning signs such as more bugs but you will not improve the overall quality of the product.  For that you must look further back in the product development life cycle.