GROWTH BLOG: Learning from Failure: Conducting Postmortems
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

Learning from Failure: Conducting Postmortems

August 21, 2019  |  Posted By: Bill Armelin

picture of a detective looking through a magnifying glass

At AKF Partners, we believe in learning aggressively, not just from your successes, but also your failures. One common failure we see are service disrupting incidents. These are the events that either make your systems unavailable or significantly degrade performance for your customers. They result in lost revenue, poor customer satisfaction and hours of lost sleep. While there are many things we can do to reduce the probability of an incident occurring or the impact if it does happen, we know that all systems fail.

We like to say, “An incident is a terrible thing to waste.” The damage is already done. Now, we need to learn as much about the causes of the incident to prevent the same failures from happening again. A common process for determining the causes of failure and preventing them from reoccurring is the postmortem. In the Army, it is called an After-Action Review. In many companies it is called a Root Cause Analysis. It doesn’t matter what you call it, as long as you do it.

We actually avoid using Root Cause Analysis. Many of our clients that use the term focus too much on finding that one “root cause” of the issue. There will never be a single cause to an incident. There will always be a chain of problems with a trigger or proximate event. This is the one event that causes the system to finally topple over. We need a process that digs into the entire chain of events inclusive of the trigger. This is where the postmortem comes in. It is a cross-functional brainstorming meeting that not only identifies the root causes of a problem, but also help in identifying issues with process and training.

Postmortem Process – TIA

The purpose of a good postmortem is to find all of the contributing events and problems that caused an incident. We use a simple three step process called TIA. TIA stands for Timeline, Issues, and Actions.

AKF postmortem process

First, we create a timeline of events leading up the issue, as well as the timeline of all the actions taken to restore service. There are multiple ways to collect the timeline of events. Some companies have a scribe that records events during the incident process. Increasingly, we are seeing companies use chat tools like Slack to record events related to restoration. The timestamp in Slack for the message is a good place to extract the timeline. Don’t start your timeline at the beginning of the incident. It starts with the activities prior to the incident that cause the triggering event (e.g. a code deployment). During the postmortem meeting, augment the timeline with additional details.

The second part of TIA is Issues. This is where we walkthrough the timeline and identify issues. We want to focus on people, process, and technology. We want to capture all of the things that either allowed the incident to happen (e.g. lack of monitoring), directly triggered it (e.g. a code push), or increased the time to restore the system to a stable state (e.g. could get the right people on the call). List each issue separately.  At this point, there is no discussion about fixing issues, we only focus on the timeline and identifying issues. There is also no reference to ownership. We also don’t want to assign blame. We want a process that provides constructive feedback to solve problems.

Avoid the tendency to find a single triggering event and stop. Make sure you continue to dig into the issues to determine why things happened the way they did. We like to use the “5-whys” methodology to explore root causes. This entails repeatedly asking questions about why something happened. The answer to one question becomes the basis for the next. We continue to ask why until we have identified the true causes of the problems.

AKF postmortem process
AKF postmortem process

Postmortem Anti-Patterns

Here is a summary of anti-patterns we see when companies conduct postmortems:

                                                             
Anti-PatternBest Practice
Not conducting a postmortem after a serious (e.g. Sev 1) incidentConduct a postmortem within a week after a serious incident
Assigning blameAvoid blame and keep it constructive
Not having the right people involvedAssemble a cross functional team of people involved or needed to resolve problems
Using a postmortem block (e.g. multiple postmortems during a 1-hour session every two weeks)Dedicate time for a postmortem based on the severity of the incident
Lack of ownership of identified tasksMake one person accountable to complete a task within an appropriate timeframe
Not digging far enough into issues (finding a single root cause)Use the 5-Why methodology to identify all of the causes for an issue

Incidents will always happen. What you do after service restoration will determine if the problem occurs again. A structured, timely postmortem process will help identify the issues causing outages and help prevent their reoccurrence in the future. It also fosters a culture of learning from your mistakes without blame.

Are you struggling with the same issues impacting your site? Do you know you should be conducting postmortems but don’t know how to get started? AKF can help you establish critical incident management and postmortem processes. Call us – we can help!

Subscribe to the AKF Newsletter

Contact Us

Top 3 Failures in Digital Transformations

July 11, 2019  |  Posted By: Marty Abbott

Attempting to transform a company to compete effectively in the Digital Economy is difficult to say the least.  In the experience of AKF Partners, it is easier to be “born digital” than to transform a successful, long tenured business, to compete effectively in the Digital age. 

There is no single guaranteed fail-safe path to transformation.  There are, however, 10 principles by which you should abide and 3 guaranteed paths to failure. 

Avoid these 3 common mistakes at all costs or suffer a failed transformation.

Top 3 Digital Transformation Failures

Having the Wrong Team and the Wrong Structure

If you have a successful business, you very likely have a very bright and engaged team.  But unless a good portion of your existing team has run a successful “born digital” business, or better yet transformed a business in the digital age, they don’t have the experience necessary to complete your transformation in the timeframe necessary for you to compete.  If you needed lifesaving surgery, you wouldn’t bet your life on a doctor learning “on the job”.  At the very least, you’d ensure that doctor was alongside a veteran and more than likely you would find a doctor with a successful track record of the surgery in question.  You should take the same approach with your transformation.

This does not mean that you need to completely replace your team.  Companies have been successful with organization strategies that include augmenting the current team with veterans.  But you need new, experienced help, as employees on your team. 

Further, to meet the need for speed of the new digital world, you need to think differently about how you organize.  The best, fastest performing Digital teams organize themselves around the outcomes they hope to achieve, not the functions that they perform.  High performing digital teams are

It also helps to hire a firm that has helped guide companies through a transformation.  AKF Partners can help. 

Planning Instead of Doing

The digital world is ever evolving.  Plans that you make today will be incorrect within 6 months.  In the digital world, no plan survives first contact with the enemy.  In the old days of packaged software and brick and mortar retail, we had to put great effort into planning to reduce the risk associated with being incorrect after rather long lead times to project completion.  In the new world, we can iterate nearly at the speed of thought.  Whereas being incorrect in the old world may have meant project failure, in the new world we strive to be incorrect early such that we can iterate and make the final solution correct with respect to the needs of the market.  Speed kills the enemy.

Eschew waterfall models, prescriptive financial models and static planning in favor of Agile methodologies, near term adaptive financial plans and OKRs.  Spend 5 percent of your time planning and 95% of your time doing.  While in the doing phase, learn to adapt quickly to failures and quickly adjust your approach to market feedback and available data. 

The successful transformation starts with a compelling vision that is outcome based, followed by a clear near-term path of multiple small steps.  The remainder of the path is unclear as we want the results of our first few steps to inform what we should do in the next iteration of steps to our final outcome.  Transformation isn’t one large investment, but a series of small investments, each having a measurable return to the business.

Knowing Instead of Discovering

Few companies thrive by repeatedly being smarter than the market.  In fact, the opposite is true – the Digital landscape is strewn with the corpses of companies whose hubris prevented them from developing the real time feedback mechanisms necessary to sense and respond to changing market dynamics.  Yesterdays approaches to success at best have diminishing returns today and at worst put you at a competitive disadvantage.

Begin your journey as a campaign of exploration.  You are finding the best path to success, and you will do it by ensuring that every solution you deploy is instrumented with sensors that help you identify the efficacy of the solution in real time.  Real time data allows us to inductively identify patterns that form specific hypothesis.  We then deductively test these hypotheses through comparatively low-cost solutions, the results of which help inform further induction.  This circle of induction and deduction propels us through our journey to success.

Subscribe to the AKF Newsletter

Contact Us

If You Are Not Measuring, You Are Not Managing

July 10, 2019  |  Posted By: Bill Armelin

image of hands holding a caliper with the word goals in between

We are surprised at how often we go into a client and find that management does not have any metrics for their teams. The managers respond that they don’t want to negatively affect the team’s autonomy or that they trust the team to do the right things. While trusting your teams is a good thing, how do you know what they are doing is right for the company? How can you compare one team to another? How do you know where to focus on improvements?

Recently, we wrote an article about team autonomy, discussing how an empowered team is autonomous within a set of constraints. The article creates an analogy to driving a car, with the driver required to reach a specific destination, but empowered to determine WHAT path to take and WHY she takes it. She has gauges, such as a speedometer to give feedback on whether she is going too fast or too slow. Imagine driving a car without a speedometer. You will never know if you are sticking to the standard (the speed limit) or when you will get to where you need to go (velocity).

As a manager, it is your responsibility to set the appropriate metrics to help your teams navigate through the path to building your product. How can you hold your teams to certain goals or standards if you can’t tell them where they are in relation to the goal or standard today? How do you know if the actions you are taking are creating or improving shareholder value?

What metrics do you set for your teams? It is an important question. Years ago, while working at a Big 6 consulting firm, I had the pleasure of working with a very astute senior manager. We were redesigning manufacturing floors into what became Lean Manufacturing. He would walk into a client and ask them what the key metrics were. He would then proceed to tell them what their key issues were. He was always right. With metrics, you get what you measure. If you align the correct metrics with key company goals, then all is great. If you misalign them, you end up with poor performance and questionable behaviors.

So, what are the right metrics for a technology team? In 2017, we published an article on what we believe are the engineering metrics by which you should measure your teams. Some of the common metrics we focused on were velocity, efficiency, and cost. At initial glance, you might think that these seem “big brother-ish.” But, in reality, these metrics will provide your engineering teams with critical feedback to how they are doing. Velocity helps a team identify structural defects within the team (and should not be used to compare against other teams or push them to get more done). Efficiency helps the teams identify where they are losing precious development time to less valuable activities, such as meetings, interviews and HR training. It helps them and their managers quantify the impact of non-development and reduce such activities.

Cost helps the team identify how much they are spending on technology. We have seen this metric particularly used effectively in companies deploying to the cloud. Many companies allow cloud spending to significantly and uncontrollably increase as they grow. Looking at costs exposes things like the need for autoscaling to reduce the number of instances required during off peak times, or to purge unused instances that should be shut down.

The key to avoiding metrics from being perceived as overbearing is to keep them transparent. The teams must understand the purpose of the metric and how it is calculated. Don’t use them punitively. Use them to help the teams understand how they are doing in relation to the larger goals. How do you align the higher-level company goals to the work you teams are performing? We like to use Objectives and Key Results, or OKRs. This concept was created by Andy Grove at Intel and brought to Google by John Doerr. The framework aims to align higher level “objectives” to measurable “key results.” An objective at one level has several key results. These key results become the objectives for the next level down and defines another set of key results at that level. This continues all the way down to the lowest levels of the company resulting in alignment of key results and objectives across the entire company.

Choosing the Right Metric

Metrics-driven institutions demonstrably outperform those that rely on intuition or “gut feel.” This stated, poorly chosen metrics or simply too many metrics may hinder performance.

  1. A handful of carefully chosen metrics. Choose a small number of key metrics over a large volume. Ideally, each Agile team should be evaluated/tasked with improving 2-3 metrics (no more than 5). (Of note, in numerous psychological studies, the quality of decision-making has actual been shown to decrease when too much information is presented).
  2. Easy to collect and or calculate. A metric such as “Number of Customer Service Tickets per Week” although crude, is better than “Engineer Hours spent fixing service” as it requires costly time/effort to collect.
  3. Directly Controllable by the Team. Assigning a metric such as “Speed and Accuracy of Search” to a Search Service is preferred to “Overall Revenue” which is less directly controllable.
  4. Reflect the Quality of Service. The number of abandoned shopping carts reflects the quality of a Shopping Cart service, whereas number of shopping cart views is not necessarily reflective of service quality.
  5. Difficult to Game. The innate human tendency to game any system should be held in check by selecting the right metrics. Simple velocity measures are easily gamed while the number of Sev 1 incidents cannot be easily gamed.
  6. Near Real Time Feedback. Metrics that can be collected and presented over short-time intervals are most desirable. Information is most valuable when fresh — Availability week over week is better than a yearly availability measure.

Managers are responsible for the performance of their teams in relation to the company’s objectives and how they create shareholder value. Measuring how your teams are performing against or their contribution to those goals is only speculation if you don’t have the correct measurements and metrics in place. The bottom line is, “If you are not measuring, you are not managing.”

Are you having difficult defining the right metrics for your teams? Are you interested in defining OKRs but don’t know where or how to get started? AKF has helped many companies identify and implement key metrics, as well as implementing OKRs.  We have over 200 years of combined experience helping companies ensure their organizations, processes, and architecture are aligned to the outcomes they desire. Contact us, we can help.

Subscribe to the AKF Newsletter

Contact Us

Data Center Lifespan Risk

July 1, 2019  |  Posted By: Greg Fennewald

As technology professionals, managing risk is an important part of the value we provide to the business.  Risk can take many forms, including threats to availability, scalability, information security, and time to market.  Physical layer risks from the data center realm can severely impact availability, as the events of the February 2019 Wells Fargo outage demonstrate.

Transitioning Away from On-Prem Hosting

Over the last decade, knowledge of data center architecture, operating principles, capabilities, and associated risks has decreased in general due to the rise of managed hosting and especially cloud hosting.  This is particularly true for small and medium sized companies, which may have chosen cloud hosting early on and thus never have dealt with colocation or owned data centers.  This is not necessarily a bad trend – why devote resources to learn domains that are not core to your value proposition? 

While knowledge of data center geekdom may have decreased, the risks associated with data centers has not substantially changed.  Even the magic pixie dust of cloud hosting is a data center at its core, albeit with a degree of operational excellence exceeding the stereotypical company-owned data center + colo combination.

Given that technologists can mitigate data center risks by choosing cloud hosting with a major provider capable of mastering data center operations, why spend any time to learn about data center risks?

  • Cloud hosting sites do encounter failures.  The ability to ask informed questions during the vendor selection process can help optimize the availability for your business.
  • Business or regulatory changes may force a company to use colocation to meet data residency or other requirements.
  • A company may grow to the size where owning data centers makes business sense for a portion of their hosting need.
  • A hosting provider could exit the business or face bankruptcy, forcing tenants to take over or move on short notice.  Been there, done that, got the T shirt.

Data Center Lifespan Risk

For the purposes of this article, we will consider data center lifespan risk.  We define this risk as the probability of an infrastructure failure causing significant, and possibly complete, business disruption and the level of difficulty in restoring functionality.

A chart of data center lifespan risk resembles a bathtub – a high level of failures as the site is first built and undergoing 5 levels of commissioning towards the left side of the chart, followed by a long period of lower threat that can extend 15 years or more.  As time continues to march on, the risk rises again, creating the right-hand side of the bathtub curve.

data center lifespan risk

The risk of failure increases over time as the useful service life of infrastructure components approach their end.  The risk of failure approaches unity over a sufficiently long-time span.

Service Life Examples

Below are some service life estimates based on our experience for critical data center components that are properly maintained;

                                               
ComponentService LifeComment
UPS batteries4 years VRLA, 12+ wet cellBattery string monitoring strongly recommended
Diesel generator30+ years12,000+ hours before overhaul, run 100 or less annually
Main switchgear PLC15 + yearsPLC model EOL is the risk
CRAH/CRAC fan motors12+ yearsThe magic smoke wants to escape
Galvanized cooling tower wet surfaces10 yearsVaries with water chemistry, stainless steel worth the cost
Electrical distribution board25+ yearsEOL of breaker style and PLC is the risk
Chilled water piping30 yearsDesign for continuous duty, ~ 7 FPS flow velocity


All the above examples are measured in years.  If you are in the early years of a data center lifespan, there’s not a lot to worry about other than batteries.  Most growing companies are more concerned about adequate capacity, availability, and cost when they create their hosting strategy.  Not much thought is given to an exit strategy.  Such an effort is probably not worth it for a startup company, but established companies need to be thinking beyond next quarter and next year.

If your product or service can survive the loss of a single hosting site without impact (i.e. multi-active sites with validated traffic shifts), you could afford to run a bit deeper into the service life timeline.  If you can’t - or, like Wells Fargo thought you could but learned the hard way that was not the case - you need to plan ahead to mitigate these risks.

The Risks

As mentioned before, the risks we want to mitigate are an impactful failure and a complex restoration after a failure.  By complex, we mean trying to find parts and trained technicians for components that were EOL 5 years ago and end of OEM support 18 months ago.  Not a fun place to be.  Would you feel comfortable running your online business with switches and routers that are EOL and EOS?  Hopefully not.  Why would you do so for your hosting location?

Mitigating the Risks

The best way to mitigate the risk of an impactful infrastructure failure is to be able to survive the loss of a hosting site regardless of type with business disruption that is acceptable to the business and customers.  That could vary, your hosting solution should be tailored to the needs of the business.

Some thoughts on aging hosting sites;

     
  • All the characteristics that make cloud hosting taste great and be less filling (containerization, automation, infrastructure as code, orchestration, etc.) can also make the effort to stand up a new site and exit an old one much less onerous.
  •  
  • If you are committed to an owned data center or colo, moving to a newer site is the best choice.  Could you combine a move with a tech refresh cycle?  Could the aging data center fulfill a different purpose such as hosting development and QA environments?  Such environments should have less business impact from a failure, and you can squeeze out the last few years of life from that site.
  •  
  • You can purchase extra spare parts for components nearing EOL or EOS and send technicians to training courses.  This can mitigate risk but is really analogous to convincing yourself that you can scale your DB by tuning the SQL queries.  Viable only to add 6 or 12 months to a move/exit timeline.

Just about any of the components mentioned above in the useful life estimate can be replaced, especially if the data center can be shut down for weeks or months to make the replacement and test the systems.  Trying to replace components while still serving traffic is extremely risky.  Very few data centers have the redundancy to replace electrical components while still providing conditioned power and cooling to the server rooms.  Those sites that can usually cannot do so without reducing their availability.  We’ve had to take a dual UPS (2N) site to a single UPS source (N) for a week to correct a serious design flaw.  Single corded is not appropriate if your DR plan checks an audit box and not much else

Conclusion

The tremendous popularity of cloud hosting does not alleviate the need to understand physical layer risks, including data center lifespan risks.  Understanding them enables technology leaders to mitigate the risks.

Interested in learning more?  Need assistance with hosting strategy?  Considering a transition to SaaS? AKF Partners can help.

 

Subscribe to the AKF Newsletter

Contact Us

Transforming to SaaS is not just a technology change

June 19, 2019  |  Posted By: Larry Steinberg


Transforming a traditional on-premise product and company to a SaaS model is currently in vogue and has many broad-reaching benefits for both producers and consumers of services. These benefits span financial, supportability, and consumption simplification. 

In order to achieve the SaaS benefits, your company must address the broad changes necessitated by SaaS and not just the product delivery and technology changes. You must consider the impact on employees, delivery, operations/security, professional services, go to market, and financial impacts.

Employee Transformation

The employee base is one key element of change – moving from a traditional ‘boxed’ software vendor to an ‘as a Service’ company changes not only the skill set but also the dynamics of the engagement with your customers. Traditionally staff has been accustomed to having an arms-length approach for the majority of customer interactions. This traditional process has consisted of a factory that’s building the software, bundling, and a small group (if any) who ‘touch’ the customer. As you move ‘to as a service’ based product, the onus on ensuring the solution is available 24x7 is on everyone within your SaaS company. Not only do you require the skillsets to ensure infrastructure and reliability – but also the support model can and should change.

Now that you are building SaaS solutions and deploy them into environments you control, the operations are much ‘closer’ to the engineers who are deploying builds. Version upgrades happen faster, defects will surface more rapidly, and engineers can build in monitoring to detect issues before or as your customers encounter them. Fixes can be provided much faster than in the past and strict separation of support and engineering organizations are no longer warranted. Escalations across organizations and separate repro steps can be collapsed. There is a significant cultural shift for your staff that has to be managed properly. It will not be natural for legacy staff to adopt a 24x7 mindset and newly minted SaaS engineers likely don’t have the industry or technology experience needed. Finding a balance of shifting culture, training, and new ‘blood’ into the organization is the best course of action.

Passion For Service Delivery

Having Services available 24x7 requires a real passion for service delivery as teams no longer have to wait for escalations from customers, now engineers control the product and operating environment. This means they can see availability and performance in real time. Staff should be proactive about identifying issues or abnormalities in the service. In order to do this the health and performance of what they built needs to be at the forefront of their everyday activities. This mindset is very different than the traditional onpremise software delivery model.

Security Implications

Shifting the operations from your customer environments to your own environments also has a security aspect. Operating the SaaS environment for customers shifts the liability from them to you. The security responsibilities expand to include protecting your customer data, hardening the environments, having a practiced plan of incident remediation, and rapid response to identified vulnerabilities in the environment or application.

Finance & Accounting

Finance and accounting are also impacted by this shift to SaaS - both on the spend/capitalization strategy as well as cost recovery models. The operational and security components are a large cost consideration which has been shifted from your customers to your organization and needs to be modeling into SaaS financials. Pricing and licensing shifts are also very common. Moving to a utility or consumption model is fairly mainstream but is generally new for the traditional product company and its customers. Traditional billing models with annual invoices might not fit with the new approach and systems + processes will need to be enhanced to handle. If you move to a utility-based model both the product and accounting teams need to partner on a solution to ensure you get paid appropriately by your customers.

Customer Service

Think through the impacts on your customer support team. Given the speed at which new releases and fixes become available the support team will need a new model to ensure they remain up to date as these delivery timeframes will be much more rapid than in the past and they must stay ahead of your customer base.

Your go to market strategies will most likely also need to be altered depending on the market and industry. To your benefit, as a SaaS company, you now have access to customer behavior and can utilize this data in order to approach opportunities within the customer base. Regarding migration, you’ll need a plan which ensures you are the best option amongst your competitors.

Most times at AKF we see companies who have only focused on product and technology changes when moving to SaaS but if the whole company doesn’t move in lockstep then progress will be limited.  You are only as strong as your weakest link.

We’ve helped companies of all sizes transition their technology – AND organization – from on-premises to the cloud through SaaS conversion. Give us a call – we can help!

 

Subscribe to the AKF Newsletter

Contact Us

What's the difference between VMs & Containers?

May 29, 2019  |  Posted By: Robin McGlothin

VMs vs Containers

Inefficiency and down time have traditionally kept CTO’s and IT decision makers up at night.  Now, new challenges are emerging driven by infrastructure inflexibility and vendor lock-in, limiting Technology more than ever and making strategic decisions more complex than ever.  Both VMs and containers can help get the most out of available hardware and software resources while easing the risk of vendor lock-in limitation. 

Containers are the new kids on the block, but VMs have been, and continue to be, tremendously popular in data centers of all sizes.  Having said that, the first lesson to learn, is containers are not virtual machines.  When I was first introduced to containers, I thought of them as light weight or trimmed down virtual instances.  This comparison made sense since most advertising material leaned on the concepts that containers use less memory and start much faster than virtual machines – basically marketing themselves as VMs.  Everywhere I looked, Docker was comparing themselves to VMs.  No wonder I was a bit confused when I started to dig into the benefits and differences between the two.

As containers evolved, they are bringing forth abstraction capabilities that are now being broadly applied to make enterprise IT more flexible. Thanks to the rise of Docker containers it’s now possible to more easily move workloads between different versions of Linux as well as orchestrate containers to create microservices.  Much like containers, a microservice is not a new idea either. The concept harkens back to service-oriented architectures (SOA). What is different is that microservices based on containers are more granular and simpler to manage. More on this topic in a blog post for another day!
If you’re looking for the best solution for running your own services in the cloud, you need to understand these virtualization technologies, how they compare to each other, and what are the best uses for each. Here’s our quick read.

VM’s vs. Containers – What’s the real scoop?

One way to think of containers vs. VMs is that while VMs run several different operating systems on one server, container technology offers the opportunity to virtualize the operating system itself.


                                                               
      Figure 1 – Virtual Machine                                                         Figure 2 - Container    

VMs help reduce expenses. Instead of running an application on a single server, a virtual machine enables utilizing one physical resource to do the job of many. Therefore, you do not have to buy, maintain and service several servers.  Because there is one host machine, it allows you to efficiently manage all the virtual environments with a centralized tool – the hypervisor. The decision to use VMs is typically made by DevOps/Infrastructure Team.  Containers help reduce expenses as well and they are remarkably lightweight and fast to launch.  Because of their small size, you can quickly scale in and out of containers and add identical containers as needed. 

Containers are excellent for Continuous Integration and Continuous Deployment (CI/CD) implementation. They foster collaborative development by distributing and merging images among developers.  Therefore, developers tend to favor Containers over VMs.  Most importantly, if the two teams work together (DevOps & Development) the decision on which technology to apply (VMs or Containers) can be made collaboratively with the best overall benefit to the product, client and company.

What are VMs?

The operating systems and their applications share hardware resources from a single host server, or from a pool of host servers. Each VM requires its own underlying OS, and the hardware is virtualized. A hypervisor, or a virtual machine monitor, is software, firmware, or hardware that creates and runs VMs. It sits between the hardware and the virtual machine and is necessary to virtualize the server.

IT departments, both large and small, have embraced virtual machines to lower costs and increase efficiencies.  However, VMs can take up a lot of system resources because each VM needs a full copy of an operating system AND a virtual copy of all the hardware that the OS needs to run.  This quickly adds up to a lot of RAM and CPU cycles. And while this is still more economical than bare metal for some applications this is still overkill and thus, containers enter the scene.

Benefits of VMs
• Reduced hardware costs from server virtualization
• Multiple OS environments can exist simultaneously on the same machine, isolated from each other.
• Easy maintenance, application provisioning, availability and convenient recovery.
• Perhaps the greatest benefit of server virtualization is the capability to move a virtual machine from one server to another quickly and safely. Backing up critical data is done quickly and effectively because you can effortlessly create a replication site.
Popular VM Providers
• VMware vSphere ESXi, VMware has been active in the virtual space since 1998 and is an industry leader setting standards for reliability, performance, and support.
• Oracle VM VirtualBox - Not sure what operating systems you are likely to use? Then VirtualBox is a good choice because it supports an amazingly wide selection of host and client combinations. VirtualBox is powerful, comes with terrific features and, best of all, it’s free.
• Xen - Xen is the open source hypervisor included in the Linux kernel and, as such, it is available in all Linux distributions. The Xen Project is one of the many open source projects managed by the Linux Foundation.
• Hyper-V - is Microsoft’s virtualization platform, or ‘hypervisor’, which enables administrators to make better use of their hardware by virtualizing multiple operating systems to run off the same physical server simultaneously.
• KVM - Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. Specifically, KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple, isolated virtual environments called guests or virtual machines (VMs).

What are Containers?

Containers are a way to wrap up an application into its own isolated ”box”. For the application in its container, it has no knowledge of any other applications or processes that exist outside of its box. Everything the application depends on to run successfully also lives inside this container. Wherever the box may move, the application will always be satisfied because it is bundled up with everything it needs to run.

Containers virtualize the OS instead of virtualizing the underlying computer like a virtual machine.  They sit on top of a physical server and its host OS — typically Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too. Shared components are read-only. Sharing OS resources such as libraries significantly reduces the need to reproduce the operating system code and means that a server can run multiple workloads with a single operating system installation. Containers are thus exceptionally light — they are only megabytes in size and take just seconds to start. Compared to containers, VMs take minutes to run and are an order of magnitude larger than an equivalent container.

In contrast to VMs, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program. This means you can put two to three times as many applications on a single server with containers than you can with VMs. In addition, with containers, you can create a portable, consistent operating environment for development, testing, and deployment. This is a huge benefit to keep the environments consistent.

Containers help isolate processes through differentiation in the operating system namespace and storage.  Leveraging operating system native capabilities, the container isolates process space, may create temporary file systems and relocate process “root” file system, etc.

Benefits of Containers

One of the biggest advantages to a container is the fact you can set aside less resources per container than you might per virtual machine. Keep in mind, containers are essentially for a single application while virtual machines need resources to run an entire operating system. For example, if you need to run multiple instances of MySQL, NGINX, or other services, using containers makes a lot of sense. If, however you need a full web server (LAMP) stack running on its own server, there is a lot to be said for running a virtual machine. A virtual machine gives you greater flexibility to choose your operating system and upgrading it as you see fit. A container by contrast, means that the container running the configured application is isolated in terms of OS upgrades from the host.

Popular Container Providers

1. Docker - Nearly synonymous with containerization, Docker is the name of both the world’s leading containerization platform and the company that is the primary sponsor of the Docker open source project.
2. Kubernetes - Google’s most significant contribution to the containerization trend is the open source containerization orchestration platform it created.
3. Although much of early work on containers was done on the Linux platform, Microsoft has fully embraced both Docker and Kubernetes containerization in general.  Azure offers two container orchestrators Azure Kubernetes Service (AKS) and Azure Service Fabric.  Service Fabric represents the next-generation platform for building and managing these enterprise-class, tier-1, applications running in containers.
4. Of course, Microsoft and Google aren’t the only vendors offering a cloud-based container service. Amazon Web Services (AWS) has its own EC2 Container Service (ECS).
5. Like the other major public cloud vendors, IBM Bluemix also offers a Docker-based container service.
6. One of the early proponents of container technology, Red Hat claims to be “the second largest contributor to the Docker and Kubernetes codebases,” and it is also part of the Open Container Initiative and the Cloud Native Computing Foundation. Its flagship container product is its OpenShift platform as a service (PaaS), which is based on Docker and Kubernetes.

Uses for VMs vs Uses for Containers

Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs, but there are some general rules of thumb.
• VMs are a better choice for running applications that require all of the operating system’s resources and functionality when you need to run multiple applications on servers or have a wide variety of operating systems to manage.
• Containers are a better choice when your biggest priority is maximizing the number of applications running on a minimal number of servers.

Container orchestrators

Because of their small size and application orientation, containers are well suited for agile delivery environments and microservice-based architectures. When you use containers and microservices, however, you can easily have hundreds or thousands of components in your environment. You may be able to manually manage a few dozen virtual machines or physical servers, but there is no way you can manage a production-scale container environment without automation. The task of automating and managing a large number of containers and how they interact is known as orchestration.

Scaling Workloads

Scalability of containerized workloads is a completely different process from VM workloads. Modern containers include only the basic services their functions require, but one of them can be a web server, such as NGINX, which also acts as a load balancer. An orchestration system, such as Google Kubernetes is capable of determining, based upon traffic patterns, when the quantity of containers needs to scale out; can replicate container images automatically; and can then remove them from the system.
For most, the ideal setup is likely to include both. With the current state of virtualization technology, the flexibility of VMs and the minimal resource requirements of containers work together to provide environments with maximum functionality.

If your organization is running many instances of the same operating system, then you should look into whether containers are a good fit. They just might save you significant time and money over VMs.

Subscribe to the AKF Newsletter

Contact Us

Results and Outcomes – Why Companies Fail

April 21, 2019  |  Posted By: Pete Ferguson

picture of woman looking at iPad with graphs and reports

Results = Results

Apple, Google, and Amazon don’t exist based on a Utopian promise of what is to come – though certainly those promises keep their customers engaged and hopeful for the future.  These companies exist because of the value they have delivered to date and created expectations for us as consumers for a consistent result.

I’m amazed at how simple of a concept Results = Results is – yet constantly we see companies struggle with the concept and we see it as a recurring theme in our 2-3 day workshops with our clients and something we look for in our technical due diligence reviews.

As a corporate survivor of 18 years, looking back I can see where I was distracted by day-today meetings, firefighting, and getting hijacked by initiatives that seemed urgent to some senior leader somewhere – but were not really all that important. 

Suddenly the quarter or half was over and it was time to do a self-evaluation and realize all the effort, all the stress, all the work, wasn’t getting the desired results I’d committed to earlier in the year and I’d have to quickly shuffle and focus on getting stuff done.

While keeping the lights on is important, it diminishes in importance when to do so is at the expense of innovating and adding value to our customers – not just struggling to maintain the status quo.

Outcomes and Key Results (OKRs)

Adapted from John Doerr’s “Objectives” and key results – at AKF we find it more to the point to focus on “outcomes.”  Objectives (definition: a thing aimed at or sought) are a path where as “outcomes” are a destination that is clearly defined to know you have arrived.

Outcomes are the only things that matter to our customers.  Hearing about a desired Utopian state is great and may excite customers to stick around for awhile and put up with current limitations or lack of functionality – but being able to clearly define that you have delivered an outcome and the value to your customers is money in the bank and puts us ahead of our competition.

Yet the majority of our clients have teams who are so focused on cost-cutting for many years that they leave a wide open berth for young startups and their competition to move in and start delivering better outcomes for the customer.

How to Focus on Results and Outcomes

It is easy to become distracted in the day-to-day meetings, incident escalations, post mortems, ect.  As an outside third party, however, it is blatantly obvious to us usually within the first hour of meeting with a new team whether or not they are properly focused.

Here are some of the common themes and questions to ask:

  • Is there effective monitoring to discover issues before our customers do?
  • Do we monitor business metrics and weigh the success (and failure) of initiatives based not on pushing out a new platform or product but whether or not there was significant ROI?
  • How much time is spent limping along to keep a legacy application up and running vs. innovating?
  • Do we continually push off hardware/software upgrades until we are held hostage by compliance and/or end-of-life serviceability by the vendor?

Hopefully the common theme here is obvious – what is the customer experience and how focused are we on them vs internal castle building or day-to-day distractions?

Recently in a team interview the IT “keep the lights on” team told us they were working to be strategic and innovative by hiring new interns.  While the younger generations are definitely less prone to accepting the status quo, the older generation are conceding that they don’t want to be part of the future.  And unfortunately they may not be sooner than planned if they don’t grasp their role in driving innovation and the importance of applying their institutional knowledge.

Not focusing on customer/shareholder related outcomes means that shareholders and customers are negatively impacted.  Here are a few problems with the associated outcomes I’ve seen in my short tenure with AKF and previously as a corporate crusader: 

Observation:

Monolithic applications to save costs: Why organizations do it?  Short term cost savings focus development on one application.  Allows teams to only focus on development of their one area.

Outcomes:

  • One failure means everyone fails.
  • Organizations are unable to scale vis-a-vis Conway’s Law (organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations).
  • Often the teams who develop the monolith don’t have to support it, so they don’t understand why it is a problem.
  • Teams become very focused on solving the problems caused by the monolith just long enough to get it back up and running but fail to see the long-term recurrent loss to the business and wasted hours that could have been spent on innovating new products and services.
  • Catastrophic failure - Intuit pre SaaS, early renditions of iTunes and annual outages when everyone tried to redeem gift cards Christmas morning, early days of eBay, stay tuned, many more yet to come.

Observation:

Ongoing cost cutting to “make the quarter.”

Outcomes:

  • MIssed tech refresh results in machines and operating systems no longer supported and vulnerable to external attacks.
  • Teams become hyper focused on shutting down additional spending, but never take the time to calculate how much wasted effort is spent on keeping the lights on for aging systems with a declining market share or slowed new customer adoption rate.
  • Start saying no to the customer based on cost opening the door for new upstarts and the competition to take away market share.

Observation:

Focusing efforts on Sales Department’s latest contract.

Outcomes:

  • Too much investment in legacy applications instead of innovating new products.
  • “A-team” developers become firefighters to keep customers happy.
  • Sales team creates moral hazards for development teams (i.e. “I smoke, but you get lung cancer” - teams create problems for other teams to fix instead of owning the end-to-end lifecycle of a product)

Observation:

Focus is on mergers and acquisitions instead of core strengths and products.

Outcomes:

  • Distracted organizations give way for upstarts and competition.
  • Become okay or maybe even good at a lot of things but not great at one or two things.
  • Company culture becomes very fragmented and silos create red tape that slows or stifles innovation.

Conclusions

Results = Results.  And nothing else equals results.


If OKRs are not measuring the results needed to compete and win, then teams are wasting a lot of effort, time, and money and the competition is getting a free pass to innovate and outperform your ability to delight and please your customers.

Need an outside view of your organization to help drive better results and outcomes?  Contact us!

Photo by rawpixel.com from Pexels

Subscribe to the AKF Newsletter

Contact Us

The AKF Difference

December 4, 2018  |  Posted By: Marty Abbott

akf difference

During the last 12 years, many prospective clients have asked us some variation of the following questions: “What makes you different?”, “Why should we consider hiring you?”, or “How are you differentiated as a firm?”.

The answer has many components.  Sometimes our answers are clear indications that we are NOT the right firm for you.  Here are the reasons you should, or should not, hire AKF Partners:

Operators and Executives – Not Consultants

Most technology consulting firms are largely comprised of employees who have only been consultants or have only run consulting companies.  We’ve been in your shoes as engineers, managers and executives.  We make decisions and provide advice based on practical experience with living with the decisions we’ve made in the past.

Engineers – Not Technicians

Educational institutions haven’t graduated enough engineers to keep up with demand within the United States for at least forty years.  To make up for the delta between supply and demand, technical training services have sprung up throughout the US to teach people technical skills in a handful of weeks or months.  These technicians understand how to put building blocks together, but they are not especially skilled in how to architect highly available, low latency, low cost to develop and operate solutions.

The largest technology consulting companies are built around programs that hire employees with non-technical college degrees.  These companies then teach these employees internally using “boot camps” – creating their own technicians.

Our company is comprised almost entirely of “engineers”; employees with highly technical backgrounds who understand both how and why the “building blocks” work as well as how to put those blocks together.

Product – Not “IT”

Most technology consulting firms are comprised of consultants who have a deep understanding of employee-facing “Information Technology” solutions.  These companies are great at helping you implement packaged software solutions or SaaS solutions such as Enterprise Resource Management systems, Customer Relationship Management Systems and the like.  Put bluntly, these companies help you with solutions that you see as a cost center in your business.  While we’ve helped some partners who refuse to use anyone else with these systems, it’s not our focus and not where we consider ourselves to be differentiated.

Very few firms have experience building complex product (revenue generating) services and platforms online.  Products (not IT) represent nearly all of AKF’s work and most of AKF’s collective experience as engineers, managers and executives within companies.  If you want back-office IT consulting help focused on employee productivity there are likely better firms with which you can work.  If you are building a product, you do not want to hire the firms that specialize in back office IT work.

Business First – Not Technology First

Products only exist to further the needs of customers and through that relationship, further the needs of the business.  We take a business-first approach in all our engagements, seeking to answer the questions of:  Can we help a way to build it faster, better, or cheaper?  Can we find ways to make it respond to customers faster, be more highly available or be more scalable?  We are technology agnostic and believe that of the several “right” solutions for a company, a small handful will emerge displaying comparatively low cost, fast time to market, appropriate availability, scalability, appropriate quality, and low cost of operations.

Cure the Disease – Don’t Just Treat the Symptoms

Most consulting firms will gladly help you with your technology needs but stop short of solving the underlying causes creating your needs:  the skill, focus, processes, or organizational construction of your product team.  The reason for this is obvious, most consulting companies are betting that if the causes aren’t fixed, you will need them back again in the future.

At AKF Partners, we approach things differently.  We believe that we have failed if we haven’t helped you solve the reasons why you called us in the first place.  To that end, we try to find the source of any problem you may have.  Whether that be missing skillsets, the need for additional leadership, organization related work impediments, or processes that stand in the way of your success – we will bring these causes to your attention in a clear and concise manner.  Moreover, we will help you understand how to fix them.  If necessary, we will stay until they are fixed.

We recognize that in taking the above approach, you may not need us back.  Our hope is that you will instead refer us to other clients in the future.

Are We “Right” for You?

That’s a question for you, not for us, to answer.  We don’t employ sales people who help “close deals” or “shape demand”.  We won’t pressure you into making a decision or hound you with multiple calls.  We want to work with clients who “want” us to partner with them – partners with whom we can join forces to create an even better product solution.

 

Subscribe to the AKF Newsletter

Contact Us

The Importance of QA

November 20, 2018  |  Posted By: Robin McGlothin

“Quality in a service or product is not what you put into it.  It’s what the customer gets out of it.”  Peter Drucker


Importance of QA - Team sitting at conference table

The Importance of QA

High levels of quality are essential to achieving company business objectives. Quality can be a competitive advantage and in many cases will be table stakes for success. High quality is not just an added value, it is an essential basic requirement. With high market competition, quality has become the market differentiator for almost all products and services.

There are many methods followed by organizations to achieve and maintain the required level of quality. So, let’s review how world-class product organizations make the most out of their QA roles. But first, let’s define QA. 

According to Wikipedia, quality assurance is “a way of preventing mistakes or defects in products and avoiding problems when delivering solutions or services to customers. But there’s much more to quality assurance.”

There are numerous benefits of having a QA team in place:

  1. Helps increase productivity while decreasing costs (QA HC typically costs less)
  2. Effective for saving costs by detecting and fixing issues and flaws before they reach the client
  3. Shifts focus from detecting issues to issue prevention

Teams and organizations looking to get serious about (or to further improve) their software testing efforts can learn something from looking at how the industry leaders organize their testing and quality assurance activities. It stands to reason that companies such as Google, Microsoft, and Amazon would not be as successful as they are without paying proper attention to the quality of the products they’re releasing into the world.  Taking a look at these software giants reveals that there is no one single recipe for success. Here is how five of the world’s best-known product companies organize their QA and what we can learn from them.

Google: Searching for best practices

Google Search Logo
How does the company responsible for the world’s most widely used search engine organize its testing efforts? It depends on the product. The team responsible for the Google search engine, for example, maintains a large and rigorous testing framework. Since search is Google’s core business, the team wants to make sure that it keeps delivering the highest possible quality, and that it doesn’t screw it up.

To that end, Google employs a four-stage testing process for changes to the search engine, consisting of:

  1. Testing by dedicated, internal testers (Google employees)
  2. Further testing on a crowdtesting platform
  3. “Dogfooding,” which involves having Google employees use the product in their daily work
  4. Beta testing, which involves releasing the product to a small group of Google product end users

Even though this seems like a solid testing process, there is room for improvement, if only because communication between the different stages and the people responsible for them is suboptimal (leading to things being tested either twice over or not at all).

But the teams responsible for Google products that are further away from the company’s core business employ a much less strict QA process. In some cases, the only testing done by the developer responsible for a specific product, with no dedicated testers providing a safety net.

In any case, Google takes testing very seriously. In fact, testers’ and developers’ salaries are equal, something you don’t see very often in the industry.

Facebook: Developer-driven testing

Facebook does not employ any dedicated testers at all. Instead, the social media giant relies on its developers to test their own (as well as one another’s) work. Facebook employs a wide variety of automated testing solutions. The tools that are used range from PHPUnit for back-end unit testing to Jest (a JavaScript test tool developed internally at Facebook) to Watir for end-to-end testing efforts.

Like Google, Facebook uses dogfooding to make sure its software is usable. Furthermore, it is somewhat notorious for shaming developers who mess things up (breaking a build or causing the site to go down by accident, for example) by posting a picture of the culprit wearing a clown nose on an internal Facebook group. No one wants to be seen on the wall-of-shame!

Facebook recognizes that there are significant flaws in its testing process, but rather than going to great lengths to improve, it simply accepts the flaws, since, as they say, “social media is nonessential.” Also, focusing less on testing means that more resources are available to focus on other, more valuable things.

Rather than testing its software through and through, Facebook tends to use “canary” releases and an incremental rollout strategy to test fixes, updates, and new features in production. For example, a new feature might first be made available only to a small percentage of the total number of users.

                            Canary Incremental Rollout
Incremental rollout of features

By tracking the usage of the feature and the feedback received, the company decides either to increase the rollout or to disable the feature, either improving it or discarding it altogether.


Amazon: Deployment comes first

Amazon logo
Like Facebook, Amazon does not have a large QA infrastructure in place. It has even been suggested (at least in the past) that Amazon does not value the QA profession. Its ratio of about one test engineer to every seven developers also suggests that testing is not considered an essential activity at Amazon.

The company itself, though, takes a different view of this. To Amazon, the ratio of testers to developers is an output variable, not an input variable. In other words, as soon as it notices that revenue is decreasing or customers are moving away due to anomalies on the website, Amazon increases its testing efforts.

The feeling at Amazon is that its development and deployment processes are so mature (the company famously deploys software every 11.6 seconds!) that there is no need for elaborate and extensive testing efforts. It is all about making software easy to deploy, and, equally if not more important, easy to roll back in case of a failure.

Spotify: Squads, tribes and chapters

Spotify does employ dedicated testers. They are part of cross-functional teams, each with a specific mission. At Spotify, employees are organized according to what’s become known as the Spotify model, constructed of:

  1. Squads. A squad is basically the Spotify take on a Scrum team, with less focus on practices and more on principles. A Spotify dictum says, “Rules are a good start, but break them when needed.” Some squads might have one or more testers, and others might have no testers at all, depending on the mission.
  2. Tribes are groups of squads that belong together based on their business domain. Any tester that’s part of a squad automatically belongs to the overarching tribe of that squad.
  3. Chapters. Across different squads and tribes, Spotify also uses chapters to group people that have the same skillset, in order to promote learning and sharing experiences. For example, all testers from different squads are grouped together in a testing chapter.
  4. Guilds. Finally, there is the concept of a guild. A guild is a community of members with shared interests. These are a group of people across the organization who want to share knowledge, tools, code and practices.

                            Spotify Team Structure
Graphic showing how team guilds are built with experts on each team

Testing at Spotify is taken very seriously. Just like programming, testing is considered a creative process, and something that cannot be (fully) automated. Contrary to most other companies mentioned, Spotify heavily relies on dedicated testers that explore and evaluate the product, instead of trying to automate as much as possible.  One final fact: In order to minimize the efforts and costs associated with spinning up and maintaining test environments, Spotify does a lot of testing in its production environment.

Microsoft: Engineers and testers are one


Microsoft’s ratio of testers to developers is currently around 2:3, and like Google, Microsoft pays testers and developers equally—except they aren’t called testers; they’re software development engineers in test (or SDETs).

The high ratio of testers to developers at Microsoft is explained by the fact that a very large chunk of the company’s revenue comes from shippable products that are installed on client computers & desktops, rather than websites and online services. Since it’s much harder (or at least much more annoying) to update these products in case of bugs or new features, Microsoft invests a lot of time, effort, and money in making sure that the quality of its products is of a high standard before shipping.

What you can learn from world-class product organizations?  If the culture, views, and processes around testing and QA can vary so greatly at five of the biggest tech companies, then it may be true that there is no one right way of organizing testing efforts. All five have crafted their testing processes, choosing what fits best for them, and all five are highly successful. They must be doing something right, right?

Still, there are a few takeaways that can be derived from the stories above to apply to your testing strategy:

  1. There’s a “testing responsibility spectrum,” ranging from “We have dedicated testers that are primarily responsible for executing tests” to “Everybody is responsible for performing testing activities.” You should choose the one that best fits the skillset of your team.
  2. There is also a “testing importance spectrum,” ranging from “Nothing goes to production untested” to “We put everything in production, and then we test there, if at all.” Where your product and organization belong on this spectrum depends on the risks that will come with failure and how easy it is for you to roll back and fix problems when they emerge.
  3. Test automation has a significant presence in all five companies. The extent to which it is implemented differs, but all five employ tools to optimize their testing efforts. You probably should too.

Bottom line, QA is relevant and critical to the success of your product strategy. If you’d tried to implement a new QA process but failed, we can help.

 

 

Subscribe to the AKF Newsletter

Contact Us

Diagnosing and Fixing Software Development Performance

November 20, 2018  |  Posted By: Roger Andelin

Two software developers at a desk computer

Diagnosing the cause of poor performance from your engineering team is difficult and can be costly for the organization if done incorrectly.  Most everyone will agree that a high performing team is more desirable than a low performing team.  However, there is rarely agreement as to why teams are not performing well and how to help them improve performance.  For example, your CFO may believe the team does not have good project management and that more project management will improve the team’s performance.  Alternatively, the CEO may believe engineers are not working hard enough because they arrive to the office late.  The CMO may believe the team is simply bad and everyone needs to be replaced.

Often times, your CTO may not even know the root causes of poor performance or even recognize there is a performance problem until peers begin to complain.  However, there are steps an organization can take to uncover the root cause of poor performance quickly, present those findings to stakeholders for greater understanding, and take steps that will properly remove the impediments to higher performance.  Those steps may include some of the solutions suggested by others, but without a complete understanding of the problem, performance will not improve and incorrect remedies will often make the situation worse.  In other words, adding more project management does not always solve a problem with on time delivery, but it will add more cost and overhead.  Requiring engineers to start each day at 8 AM sharp may give the appearance that work is getting done, but it may not directly improve velocity.  Firing good engineers who face legitimate challenges to their performance may do irreversible harm to the organization.  For instance, it may appear arbitrary to others and create more fear in the department resulting in unwanted attrition. Taking improper action will make things worse rather than improve the situation.

How can you know what action to take to fix an engineering performance problem?  The first step in that process is to correctly define and agree upon what good performance looks like.  Good performance is comprised of two factors: velocity and value.

Velocity is defined as the speed at which the team works and value is defined as achievement of business goals.  Velocity is measured in story points which represent the amount of work completed.  Value is measured in business terms such as revenue, customer satisfaction or conversion.  High performing engineering teams work quickly and their work has a measurable impact on business goals.  High performing teams put as much focus on delivering a timely release as they do on delivering the right release to achieve a business goal.

Once you have agreement on the definition of good engineering performance, rate each of your engineering teams against the two criteria: velocity and value.  You may use a chart like the one below:

AKF Partners Velocity plus Business Value equals 10 xers which are high performers

Once each team has been rated, write down a narrative that justifies the rating.  Here are a few examples:

Bottom Left: Velocity and Value are Low

“My requests always seem to take a long time.  Even the most simple of requests takes forever.  And, when the team finally gets around to completing the request, often times there are problems in production once the release is completed.  These problems have negatively impacted customers’ confidence in us so not only are engineers not delivering value – they are eroding it!”

Upper/Middle Left: Velocity is Good and Value is Low

“The team does get stuff done.  Of course I’d like them to go faster, but generally speaking they are able to get things done in a reasonable amount of time.  However, I can’t say if they are delivering value – when we release something we are not tracking any business metrics so I have no way of knowing!”

Upper Right: Velocity is High and Value is High

“The team is really good.  They are tracking their velocity in story points and have goals to improve velocity.  They are already up 10% over last year.  Also, they instrument all their releases to measure business value.  They are actively working with product management to understand what value needs to be delivered and hypothesize with the stakeholders as to what features will be best to deliver the intended business goal.  This team is a pleasure to work with.”

Unknown Velocity and Unknown Value

“I don’t know how to rate this team.  I don’t know their velocity; its always changing and seems meaningless.  I think the team does deliver business value, but they are not measuring it so I cannot say if it is low or high.”

With narratives in hand it’s time to begin digging for more data to support or invalidate the ratings.

Diagnosing Velocity Problems

Engineering velocity is a function of time spent developing.  Therefore, the first question to answer is “what is the maximum amount of time my team is able to spend on engineering work under ideal conditions?”

This is a calculated value.  For example, start with a 40 hour work week.  Next, assuming your teams are following an Agile software development process, for each engineering role subtract out the time needed each week for meetings and other non-development work.  For individual contributors working in an Agile process that number is about 5 hours per week (for stand up, review, planning and retro).  For managers the number may be larger.  For each role on the team sum up the hours.  This is your ideal maximum.

Next, with the ideal maximum in hand, compare that to the actual achievement.  If your teams are not logging hours against their engineering tasks, they will need to do this in order to complete this exercise.  Evaluate the gap between the ideal maximum and the actual.  For example, if the ideal number is 280 hours and the team is logging 200 hours, then the gap is 80 hours.  You need to determine where that 80 hours is going and why.  Here are some potential problems to consider:

  1. Teams are spending extra time in planning meetings to refine requirements and evaluating effort.
  2. Team members are being interrupted by customer incidents which they are required to support.
  3. The team must support the weekly release process in addition to their other engineering tasks.
  4. Miscellaneous meeting are being called by stakeholders including project status meetings and updates.

As you dig into this gap it will become clear what needs to be fixed.  The results will probably surprise you.  For example, one client was faced with a software quality problem.  Determined to improve their software quality, the client added more quality engineers, built more unit tests, and built more automated system tests.  While there is nothing inherently wrong with this, it did not address the root cause of their poor quality: Rushing.  Engineers were spending about 3-4 hours per day on their engineering tasks.  Context switching, interruptions and unnecessary meetings eroded quality engineering time each day.  As a result, engineers rushing to complete their work tasks made novice mistakes.  Improving engineering performance required a plan for reducing engineering interruptions, unnecessary meetings, and enabling engineers to spend more uninterrupted time on their development tasks.

At another client, the frequency of production support incidents were impacting team velocity.  Engineers were being pulled away from their daily engineering tasks to work on problems in production.  This had gone on so long that while nobody liked it, they accepted it as normal.  It’s not normal!  Digging into the issue, the root cause was uncovered: The process for managing production incidents was ineffective.  Every incident was urgent and nearly every incident disrupted the engineering team.  To improve this, a triage process was introduced whereby each incident was classified and either assigned an urgent status (which would create an interruption for the team) or something lower which was then placed on the product backlog (no interruption for the team).  We also learned the old process (every incident was urgent) was in part a response to another velocity problem; stakeholders believed that unless something was considered urgent it would never get fixed by the engineering team.  By having an incident triage process, a procedure for when something would get fixed based on its urgency, the engineering team and the stakeholders solved this problem.

At AKF, we are experts at helping engineering teams improve efficiency, performance, fixing velocity problems, and improving value.  In many cases, the prescription for the team is not obvious.  Our consultants help company leaders uncover the root causes of their performance problems, establish vision and execute prescriptions that result in meaningful change.  Let us help you with your performance problems so your teams can perform at their best!

 

Subscribe to the AKF Newsletter

Contact Us

 1 2 3 >  Last ›

Categories:

Most Popular: