What's the difference between VMs & Containers?
May 29, 2019 | Posted By: Robin McGlothin
VMs vs Containers
Inefficiency and down time have traditionally kept CTO’s and IT decision makers up at night. Now, new challenges are emerging driven by infrastructure inflexibility and vendor lock-in, limiting Technology more than ever and making strategic decisions more complex than ever. Both VMs and containers can help get the most out of available hardware and software resources while easing the risk of vendor lock-in limitation.
Containers are the new kids on the block, but VMs have been, and continue to be, tremendously popular in data centers of all sizes. Having said that, the first lesson to learn, is containers are not virtual machines. When I was first introduced to containers, I thought of them as light weight or trimmed down virtual instances. This comparison made sense since most advertising material leaned on the concepts that containers use less memory and start much faster than virtual machines – basically marketing themselves as VMs. Everywhere I looked, Docker was comparing themselves to VMs. No wonder I was a bit confused when I started to dig into the benefits and differences between the two.
As containers evolved, they are bringing forth abstraction capabilities that are now being broadly applied to make enterprise IT more flexible. Thanks to the rise of Docker containers it’s now possible to more easily move workloads between different versions of Linux as well as orchestrate containers to create microservices. Much like containers, a microservice is not a new idea either. The concept harkens back to service-oriented architectures (SOA). What is different is that microservices based on containers are more granular and simpler to manage. More on this topic in a blog post for another day!
If you’re looking for the best solution for running your own services in the cloud, you need to understand these virtualization technologies, how they compare to each other, and what are the best uses for each. Here’s our quick read.
VM’s vs. Containers – What’s the real scoop?
One way to think of containers vs. VMs is that while VMs run several different operating systems on one server, container technology offers the opportunity to virtualize the operating system itself.
Figure 1 – Virtual Machine Figure 2 - Container
VMs help reduce expenses. Instead of running an application on a single server, a virtual machine enables utilizing one physical resource to do the job of many. Therefore, you do not have to buy, maintain and service several servers. Because there is one host machine, it allows you to efficiently manage all the virtual environments with a centralized tool – the hypervisor. The decision to use VMs is typically made by DevOps/Infrastructure Team. Containers help reduce expenses as well and they are remarkably lightweight and fast to launch. Because of their small size, you can quickly scale in and out of containers and add identical containers as needed.
Containers are excellent for Continuous Integration and Continuous Deployment (CI/CD) implementation. They foster collaborative development by distributing and merging images among developers. Therefore, developers tend to favor Containers over VMs. Most importantly, if the two teams work together (DevOps & Development) the decision on which technology to apply (VMs or Containers) can be made collaboratively with the best overall benefit to the product, client and company.
What are VMs?
The operating systems and their applications share hardware resources from a single host server, or from a pool of host servers. Each VM requires its own underlying OS, and the hardware is virtualized. A hypervisor, or a virtual machine monitor, is software, firmware, or hardware that creates and runs VMs. It sits between the hardware and the virtual machine and is necessary to virtualize the server.
IT departments, both large and small, have embraced virtual machines to lower costs and increase efficiencies. However, VMs can take up a lot of system resources because each VM needs a full copy of an operating system AND a virtual copy of all the hardware that the OS needs to run. This quickly adds up to a lot of RAM and CPU cycles. And while this is still more economical than bare metal for some applications this is still overkill and thus, containers enter the scene.
Benefits of VMs
• Reduced hardware costs from server virtualization
• Multiple OS environments can exist simultaneously on the same machine, isolated from each other.
• Easy maintenance, application provisioning, availability and convenient recovery.
• Perhaps the greatest benefit of server virtualization is the capability to move a virtual machine from one server to another quickly and safely. Backing up critical data is done quickly and effectively because you can effortlessly create a replication site.
Popular VM Providers
• VMware vSphere ESXi, VMware has been active in the virtual space since 1998 and is an industry leader setting standards for reliability, performance, and support.
• Oracle VM VirtualBox - Not sure what operating systems you are likely to use? Then VirtualBox is a good choice because it supports an amazingly wide selection of host and client combinations. VirtualBox is powerful, comes with terrific features and, best of all, it’s free.
• Xen - Xen is the open source hypervisor included in the Linux kernel and, as such, it is available in all Linux distributions. The Xen Project is one of the many open source projects managed by the Linux Foundation.
• Hyper-V - is Microsoft’s virtualization platform, or ‘hypervisor’, which enables administrators to make better use of their hardware by virtualizing multiple operating systems to run off the same physical server simultaneously.
• KVM - Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. Specifically, KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple, isolated virtual environments called guests or virtual machines (VMs).
What are Containers?
Containers are a way to wrap up an application into its own isolated ”box”. For the application in its container, it has no knowledge of any other applications or processes that exist outside of its box. Everything the application depends on to run successfully also lives inside this container. Wherever the box may move, the application will always be satisfied because it is bundled up with everything it needs to run.
Containers virtualize the OS instead of virtualizing the underlying computer like a virtual machine. They sit on top of a physical server and its host OS — typically Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too. Shared components are read-only. Sharing OS resources such as libraries significantly reduces the need to reproduce the operating system code and means that a server can run multiple workloads with a single operating system installation. Containers are thus exceptionally light — they are only megabytes in size and take just seconds to start. Compared to containers, VMs take minutes to run and are an order of magnitude larger than an equivalent container.
In contrast to VMs, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program. This means you can put two to three times as many applications on a single server with containers than you can with VMs. In addition, with containers, you can create a portable, consistent operating environment for development, testing, and deployment. This is a huge benefit to keep the environments consistent.
Containers help isolate processes through differentiation in the operating system namespace and storage. Leveraging operating system native capabilities, the container isolates process space, may create temporary file systems and relocate process “root” file system, etc.
Benefits of Containers
One of the biggest advantages to a container is the fact you can set aside less resources per container than you might per virtual machine. Keep in mind, containers are essentially for a single application while virtual machines need resources to run an entire operating system. For example, if you need to run multiple instances of MySQL, NGINX, or other services, using containers makes a lot of sense. If, however you need a full web server (LAMP) stack running on its own server, there is a lot to be said for running a virtual machine. A virtual machine gives you greater flexibility to choose your operating system and upgrading it as you see fit. A container by contrast, means that the container running the configured application is isolated in terms of OS upgrades from the host.
Popular Container Providers
1. Docker - Nearly synonymous with containerization, Docker is the name of both the world’s leading containerization platform and the company that is the primary sponsor of the Docker open source project.
2. Kubernetes - Google’s most significant contribution to the containerization trend is the open source containerization orchestration platform it created.
3. Although much of early work on containers was done on the Linux platform, Microsoft has fully embraced both Docker and Kubernetes containerization in general. Azure offers two container orchestrators Azure Kubernetes Service (AKS) and Azure Service Fabric. Service Fabric represents the next-generation platform for building and managing these enterprise-class, tier-1, applications running in containers.
4. Of course, Microsoft and Google aren’t the only vendors offering a cloud-based container service. Amazon Web Services (AWS) has its own EC2 Container Service (ECS).
5. Like the other major public cloud vendors, IBM Bluemix also offers a Docker-based container service.
6. One of the early proponents of container technology, Red Hat claims to be “the second largest contributor to the Docker and Kubernetes codebases,” and it is also part of the Open Container Initiative and the Cloud Native Computing Foundation. Its flagship container product is its OpenShift platform as a service (PaaS), which is based on Docker and Kubernetes.
Uses for VMs vs Uses for Containers
Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs, but there are some general rules of thumb.
• VMs are a better choice for running applications that require all of the operating system’s resources and functionality when you need to run multiple applications on servers or have a wide variety of operating systems to manage.
• Containers are a better choice when your biggest priority is maximizing the number of applications running on a minimal number of servers.
Because of their small size and application orientation, containers are well suited for agile delivery environments and microservice-based architectures. When you use containers and microservices, however, you can easily have hundreds or thousands of components in your environment. You may be able to manually manage a few dozen virtual machines or physical servers, but there is no way you can manage a production-scale container environment without automation. The task of automating and managing a large number of containers and how they interact is known as orchestration.
Scalability of containerized workloads is a completely different process from VM workloads. Modern containers include only the basic services their functions require, but one of them can be a web server, such as NGINX, which also acts as a load balancer. An orchestration system, such as Google Kubernetes is capable of determining, based upon traffic patterns, when the quantity of containers needs to scale out; can replicate container images automatically; and can then remove them from the system.
For most, the ideal setup is likely to include both. With the current state of virtualization technology, the flexibility of VMs and the minimal resource requirements of containers work together to provide environments with maximum functionality.
If your organization is running many instances of the same operating system, then you should look into whether containers are a good fit. They just might save you significant time and money over VMs.
Subscribe to the AKF Newsletter
Results and Outcomes – Why Companies Fail
April 21, 2019 | Posted By: Pete Ferguson
Results = Results
Apple, Google, and Amazon don’t exist based on a Utopian promise of what is to come – though certainly those promises keep their customers engaged and hopeful for the future. These companies exist because of the value they have delivered to date and created expectations for us as consumers for a consistent result.
I’m amazed at how simple of a concept Results = Results is – yet constantly we see companies struggle with the concept and we see it as a recurring theme in our 2-3 day workshops with our clients and something we look for in our technical due diligence reviews.
As a corporate survivor of 18 years, looking back I can see where I was distracted by day-today meetings, firefighting, and getting hijacked by initiatives that seemed urgent to some senior leader somewhere – but were not really all that important.
Suddenly the quarter or half was over and it was time to do a self-evaluation and realize all the effort, all the stress, all the work, wasn’t getting the desired results I’d committed to earlier in the year and I’d have to quickly shuffle and focus on getting stuff done.
While keeping the lights on is important, it diminishes in importance when to do so is at the expense of innovating and adding value to our customers – not just struggling to maintain the status quo.
Outcomes and Key Results (OKRs)
Adapted from John Doerr’s “Objectives” and key results – at AKF we find it more to the point to focus on “outcomes.” Objectives (definition: a thing aimed at or sought) are a path where as “outcomes” are a destination that is clearly defined to know you have arrived.
Outcomes are the only things that matter to our customers. Hearing about a desired Utopian state is great and may excite customers to stick around for awhile and put up with current limitations or lack of functionality – but being able to clearly define that you have delivered an outcome and the value to your customers is money in the bank and puts us ahead of our competition.
Yet the majority of our clients have teams who are so focused on cost-cutting for many years that they leave a wide open berth for young startups and their competition to move in and start delivering better outcomes for the customer.
How to Focus on Results and Outcomes
It is easy to become distracted in the day-to-day meetings, incident escalations, post mortems, ect. As an outside third party, however, it is blatantly obvious to us usually within the first hour of meeting with a new team whether or not they are properly focused.
Here are some of the common themes and questions to ask:
- Is there effective monitoring to discover issues before our customers do?
- Do we monitor business metrics and weigh the success (and failure) of initiatives based not on pushing out a new platform or product but whether or not there was significant ROI?
- How much time is spent limping along to keep a legacy application up and running vs. innovating?
- Do we continually push off hardware/software upgrades until we are held hostage by compliance and/or end-of-life serviceability by the vendor?
Hopefully the common theme here is obvious – what is the customer experience and how focused are we on them vs internal castle building or day-to-day distractions?
Recently in a team interview the IT “keep the lights on” team told us they were working to be strategic and innovative by hiring new interns. While the younger generations are definitely less prone to accepting the status quo, the older generation are conceding that they don’t want to be part of the future. And unfortunately they may not be sooner than planned if they don’t grasp their role in driving innovation and the importance of applying their institutional knowledge.
Not focusing on customer/shareholder related outcomes means that shareholders and customers are negatively impacted. Here are a few problems with the associated outcomes I’ve seen in my short tenure with AKF and previously as a corporate crusader:
Monolithic applications to save costs: Why organizations do it? Short term cost savings focus development on one application. Allows teams to only focus on development of their one area.
- One failure means everyone fails.
- Organizations are unable to scale vis-a-vis Conway’s Law (organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations).
- Often the teams who develop the monolith don’t have to support it, so they don’t understand why it is a problem.
- Teams become very focused on solving the problems caused by the monolith just long enough to get it back up and running but fail to see the long-term recurrent loss to the business and wasted hours that could have been spent on innovating new products and services.
- Catastrophic failure - Intuit pre SaaS, early renditions of iTunes and annual outages when everyone tried to redeem gift cards Christmas morning, early days of eBay, stay tuned, many more yet to come.
Ongoing cost cutting to “make the quarter.”
- MIssed tech refresh results in machines and operating systems no longer supported and vulnerable to external attacks.
- Teams become hyper focused on shutting down additional spending, but never take the time to calculate how much wasted effort is spent on keeping the lights on for aging systems with a declining market share or slowed new customer adoption rate.
- Start saying no to the customer based on cost opening the door for new upstarts and the competition to take away market share.
Focusing efforts on Sales Department’s latest contract.
- Too much investment in legacy applications instead of innovating new products.
- “A-team” developers become firefighters to keep customers happy.
- Sales team creates moral hazards for development teams (i.e. “I smoke, but you get lung cancer” - teams create problems for other teams to fix instead of owning the end-to-end lifecycle of a product)
Focus is on mergers and acquisitions instead of core strengths and products.
- Distracted organizations give way for upstarts and competition.
- Become okay or maybe even good at a lot of things but not great at one or two things.
- Company culture becomes very fragmented and silos create red tape that slows or stifles innovation.
Results = Results. And nothing else equals results.
If OKRs are not measuring the results needed to compete and win, then teams are wasting a lot of effort, time, and money and the competition is getting a free pass to innovate and outperform your ability to delight and please your customers.
Need an outside view of your organization to help drive better results and outcomes? Contact us!
Photo by rawpixel.com from Pexels
Subscribe to the AKF Newsletter
The AKF Difference
December 4, 2018 | Posted By: Marty Abbott
During the last 12 years, many prospective clients have asked us some variation of the following questions: “What makes you different?”, “Why should we consider hiring you?”, or “How are you differentiated as a firm?”.
The answer has many components. Sometimes our answers are clear indications that we are NOT the right firm for you. Here are the reasons you should, or should not, hire AKF Partners:
Operators and Executives – Not Consultants
Most technology consulting firms are largely comprised of employees who have only been consultants or have only run consulting companies. We’ve been in your shoes as engineers, managers and executives. We make decisions and provide advice based on practical experience with living with the decisions we’ve made in the past.
Engineers – Not Technicians
Educational institutions haven’t graduated enough engineers to keep up with demand within the United States for at least forty years. To make up for the delta between supply and demand, technical training services have sprung up throughout the US to teach people technical skills in a handful of weeks or months. These technicians understand how to put building blocks together, but they are not especially skilled in how to architect highly available, low latency, low cost to develop and operate solutions.
The largest technology consulting companies are built around programs that hire employees with non-technical college degrees. These companies then teach these employees internally using “boot camps” – creating their own technicians.
Our company is comprised almost entirely of “engineers”; employees with highly technical backgrounds who understand both how and why the “building blocks” work as well as how to put those blocks together.
Product – Not “IT”
Most technology consulting firms are comprised of consultants who have a deep understanding of employee-facing “Information Technology” solutions. These companies are great at helping you implement packaged software solutions or SaaS solutions such as Enterprise Resource Management systems, Customer Relationship Management Systems and the like. Put bluntly, these companies help you with solutions that you see as a cost center in your business. While we’ve helped some partners who refuse to use anyone else with these systems, it’s not our focus and not where we consider ourselves to be differentiated.
Very few firms have experience building complex product (revenue generating) services and platforms online. Products (not IT) represent nearly all of AKF’s work and most of AKF’s collective experience as engineers, managers and executives within companies. If you want back-office IT consulting help focused on employee productivity there are likely better firms with which you can work. If you are building a product, you do not want to hire the firms that specialize in back office IT work.
Business First – Not Technology First
Products only exist to further the needs of customers and through that relationship, further the needs of the business. We take a business-first approach in all our engagements, seeking to answer the questions of: Can we help a way to build it faster, better, or cheaper? Can we find ways to make it respond to customers faster, be more highly available or be more scalable? We are technology agnostic and believe that of the several “right” solutions for a company, a small handful will emerge displaying comparatively low cost, fast time to market, appropriate availability, scalability, appropriate quality, and low cost of operations.
Cure the Disease – Don’t Just Treat the Symptoms
Most consulting firms will gladly help you with your technology needs but stop short of solving the underlying causes creating your needs: the skill, focus, processes, or organizational construction of your product team. The reason for this is obvious, most consulting companies are betting that if the causes aren’t fixed, you will need them back again in the future.
At AKF Partners, we approach things differently. We believe that we have failed if we haven’t helped you solve the reasons why you called us in the first place. To that end, we try to find the source of any problem you may have. Whether that be missing skillsets, the need for additional leadership, organization related work impediments, or processes that stand in the way of your success – we will bring these causes to your attention in a clear and concise manner. Moreover, we will help you understand how to fix them. If necessary, we will stay until they are fixed.
We recognize that in taking the above approach, you may not need us back. Our hope is that you will instead refer us to other clients in the future.
Are We “Right” for You?
That’s a question for you, not for us, to answer. We don’t employ sales people who help “close deals” or “shape demand”. We won’t pressure you into making a decision or hound you with multiple calls. We want to work with clients who “want” us to partner with them – partners with whom we can join forces to create an even better product solution.
Subscribe to the AKF Newsletter
The Importance of QA
November 20, 2018 | Posted By: Robin McGlothin
“Quality in a service or product is not what you put into it. It’s what the customer gets out of it.” Peter Drucker
The Importance of QA
High levels of quality are essential to achieving company business objectives. Quality can be a competitive advantage and in many cases will be table stakes for success. High quality is not just an added value, it is an essential basic requirement. With high market competition, quality has become the market differentiator for almost all products and services.
There are many methods followed by organizations to achieve and maintain the required level of quality. So, let’s review how world-class product organizations make the most out of their QA roles. But first, let’s define QA.
According to Wikipedia, quality assurance is “a way of preventing mistakes or defects in products and avoiding problems when delivering solutions or services to customers. But there’s much more to quality assurance.”
There are numerous benefits of having a QA team in place:
- Helps increase productivity while decreasing costs (QA HC typically costs less)
- Effective for saving costs by detecting and fixing issues and flaws before they reach the client
- Shifts focus from detecting issues to issue prevention
Teams and organizations looking to get serious about (or to further improve) their software testing efforts can learn something from looking at how the industry leaders organize their testing and quality assurance activities. It stands to reason that companies such as Google, Microsoft, and Amazon would not be as successful as they are without paying proper attention to the quality of the products they’re releasing into the world. Taking a look at these software giants reveals that there is no one single recipe for success. Here is how five of the world’s best-known product companies organize their QA and what we can learn from them.
Google: Searching for best practices
How does the company responsible for the world’s most widely used search engine organize its testing efforts? It depends on the product. The team responsible for the Google search engine, for example, maintains a large and rigorous testing framework. Since search is Google’s core business, the team wants to make sure that it keeps delivering the highest possible quality, and that it doesn’t screw it up.
To that end, Google employs a four-stage testing process for changes to the search engine, consisting of:
- Testing by dedicated, internal testers (Google employees)
- Further testing on a crowdtesting platform
- “Dogfooding,” which involves having Google employees use the product in their daily work
- Beta testing, which involves releasing the product to a small group of Google product end users
Even though this seems like a solid testing process, there is room for improvement, if only because communication between the different stages and the people responsible for them is suboptimal (leading to things being tested either twice over or not at all).
But the teams responsible for Google products that are further away from the company’s core business employ a much less strict QA process. In some cases, the only testing done by the developer responsible for a specific product, with no dedicated testers providing a safety net.
In any case, Google takes testing very seriously. In fact, testers’ and developers’ salaries are equal, something you don’t see very often in the industry.
Facebook: Developer-driven testing
Like Google, Facebook uses dogfooding to make sure its software is usable. Furthermore, it is somewhat notorious for shaming developers who mess things up (breaking a build or causing the site to go down by accident, for example) by posting a picture of the culprit wearing a clown nose on an internal Facebook group. No one wants to be seen on the wall-of-shame!
Facebook recognizes that there are significant flaws in its testing process, but rather than going to great lengths to improve, it simply accepts the flaws, since, as they say, “social media is nonessential.” Also, focusing less on testing means that more resources are available to focus on other, more valuable things.
Rather than testing its software through and through, Facebook tends to use “canary” releases and an incremental rollout strategy to test fixes, updates, and new features in production. For example, a new feature might first be made available only to a small percentage of the total number of users.
Canary Incremental Rollout
By tracking the usage of the feature and the feedback received, the company decides either to increase the rollout or to disable the feature, either improving it or discarding it altogether.
Amazon: Deployment comes first
Like Facebook, Amazon does not have a large QA infrastructure in place. It has even been suggested (at least in the past) that Amazon does not value the QA profession. Its ratio of about one test engineer to every seven developers also suggests that testing is not considered an essential activity at Amazon.
The company itself, though, takes a different view of this. To Amazon, the ratio of testers to developers is an output variable, not an input variable. In other words, as soon as it notices that revenue is decreasing or customers are moving away due to anomalies on the website, Amazon increases its testing efforts.
The feeling at Amazon is that its development and deployment processes are so mature (the company famously deploys software every 11.6 seconds!) that there is no need for elaborate and extensive testing efforts. It is all about making software easy to deploy, and, equally if not more important, easy to roll back in case of a failure.
Spotify: Squads, tribes and chapters
Spotify does employ dedicated testers. They are part of cross-functional teams, each with a specific mission. At Spotify, employees are organized according to what’s become known as the Spotify model, constructed of:
- Squads. A squad is basically the Spotify take on a Scrum team, with less focus on practices and more on principles. A Spotify dictum says, “Rules are a good start, but break them when needed.” Some squads might have one or more testers, and others might have no testers at all, depending on the mission.
- Tribes are groups of squads that belong together based on their business domain. Any tester that’s part of a squad automatically belongs to the overarching tribe of that squad.
- Chapters. Across different squads and tribes, Spotify also uses chapters to group people that have the same skillset, in order to promote learning and sharing experiences. For example, all testers from different squads are grouped together in a testing chapter.
- Guilds. Finally, there is the concept of a guild. A guild is a community of members with shared interests. These are a group of people across the organization who want to share knowledge, tools, code and practices.
Spotify Team Structure
Testing at Spotify is taken very seriously. Just like programming, testing is considered a creative process, and something that cannot be (fully) automated. Contrary to most other companies mentioned, Spotify heavily relies on dedicated testers that explore and evaluate the product, instead of trying to automate as much as possible. One final fact: In order to minimize the efforts and costs associated with spinning up and maintaining test environments, Spotify does a lot of testing in its production environment.
Microsoft: Engineers and testers are one
Microsoft’s ratio of testers to developers is currently around 2:3, and like Google, Microsoft pays testers and developers equally—except they aren’t called testers; they’re software development engineers in test (or SDETs).
The high ratio of testers to developers at Microsoft is explained by the fact that a very large chunk of the company’s revenue comes from shippable products that are installed on client computers & desktops, rather than websites and online services. Since it’s much harder (or at least much more annoying) to update these products in case of bugs or new features, Microsoft invests a lot of time, effort, and money in making sure that the quality of its products is of a high standard before shipping.
What you can learn from world-class product organizations? If the culture, views, and processes around testing and QA can vary so greatly at five of the biggest tech companies, then it may be true that there is no one right way of organizing testing efforts. All five have crafted their testing processes, choosing what fits best for them, and all five are highly successful. They must be doing something right, right?
Still, there are a few takeaways that can be derived from the stories above to apply to your testing strategy:
- There’s a “testing responsibility spectrum,” ranging from “We have dedicated testers that are primarily responsible for executing tests” to “Everybody is responsible for performing testing activities.” You should choose the one that best fits the skillset of your team.
- There is also a “testing importance spectrum,” ranging from “Nothing goes to production untested” to “We put everything in production, and then we test there, if at all.” Where your product and organization belong on this spectrum depends on the risks that will come with failure and how easy it is for you to roll back and fix problems when they emerge.
- Test automation has a significant presence in all five companies. The extent to which it is implemented differs, but all five employ tools to optimize their testing efforts. You probably should too.
Bottom line, QA is relevant and critical to the success of your product strategy. If you’d tried to implement a new QA process but failed, we can help.
Subscribe to the AKF Newsletter
Diagnosing and Fixing Software Development Performance
November 20, 2018 | Posted By: Roger Andelin
Diagnosing the cause of poor performance from your engineering team is difficult and can be costly for the organization if done incorrectly. Most everyone will agree that a high performing team is more desirable than a low performing team. However, there is rarely agreement as to why teams are not performing well and how to help them improve performance. For example, your CFO may believe the team does not have good project management and that more project management will improve the team’s performance. Alternatively, the CEO may believe engineers are not working hard enough because they arrive to the office late. The CMO may believe the team is simply bad and everyone needs to be replaced.
Often times, your CTO may not even know the root causes of poor performance or even recognize there is a performance problem until peers begin to complain. However, there are steps an organization can take to uncover the root cause of poor performance quickly, present those findings to stakeholders for greater understanding, and take steps that will properly remove the impediments to higher performance. Those steps may include some of the solutions suggested by others, but without a complete understanding of the problem, performance will not improve and incorrect remedies will often make the situation worse. In other words, adding more project management does not always solve a problem with on time delivery, but it will add more cost and overhead. Requiring engineers to start each day at 8 AM sharp may give the appearance that work is getting done, but it may not directly improve velocity. Firing good engineers who face legitimate challenges to their performance may do irreversible harm to the organization. For instance, it may appear arbitrary to others and create more fear in the department resulting in unwanted attrition. Taking improper action will make things worse rather than improve the situation.
How can you know what action to take to fix an engineering performance problem? The first step in that process is to correctly define and agree upon what good performance looks like. Good performance is comprised of two factors: velocity and value.
Velocity is defined as the speed at which the team works and value is defined as achievement of business goals. Velocity is measured in story points which represent the amount of work completed. Value is measured in business terms such as revenue, customer satisfaction or conversion. High performing engineering teams work quickly and their work has a measurable impact on business goals. High performing teams put as much focus on delivering a timely release as they do on delivering the right release to achieve a business goal.
Once you have agreement on the definition of good engineering performance, rate each of your engineering teams against the two criteria: velocity and value. You may use a chart like the one below:
Once each team has been rated, write down a narrative that justifies the rating. Here are a few examples:
Bottom Left: Velocity and Value are Low
“My requests always seem to take a long time. Even the most simple of requests takes forever. And, when the team finally gets around to completing the request, often times there are problems in production once the release is completed. These problems have negatively impacted customers’ confidence in us so not only are engineers not delivering value – they are eroding it!”
Upper/Middle Left: Velocity is Good and Value is Low
“The team does get stuff done. Of course I’d like them to go faster, but generally speaking they are able to get things done in a reasonable amount of time. However, I can’t say if they are delivering value – when we release something we are not tracking any business metrics so I have no way of knowing!”
Upper Right: Velocity is High and Value is High
“The team is really good. They are tracking their velocity in story points and have goals to improve velocity. They are already up 10% over last year. Also, they instrument all their releases to measure business value. They are actively working with product management to understand what value needs to be delivered and hypothesize with the stakeholders as to what features will be best to deliver the intended business goal. This team is a pleasure to work with.”
Unknown Velocity and Unknown Value
“I don’t know how to rate this team. I don’t know their velocity; its always changing and seems meaningless. I think the team does deliver business value, but they are not measuring it so I cannot say if it is low or high.”
With narratives in hand it’s time to begin digging for more data to support or invalidate the ratings.
Diagnosing Velocity Problems
Engineering velocity is a function of time spent developing. Therefore, the first question to answer is “what is the maximum amount of time my team is able to spend on engineering work under ideal conditions?”
This is a calculated value. For example, start with a 40 hour work week. Next, assuming your teams are following an Agile software development process, for each engineering role subtract out the time needed each week for meetings and other non-development work. For individual contributors working in an Agile process that number is about 5 hours per week (for stand up, review, planning and retro). For managers the number may be larger. For each role on the team sum up the hours. This is your ideal maximum.
Next, with the ideal maximum in hand, compare that to the actual achievement. If your teams are not logging hours against their engineering tasks, they will need to do this in order to complete this exercise. Evaluate the gap between the ideal maximum and the actual. For example, if the ideal number is 280 hours and the team is logging 200 hours, then the gap is 80 hours. You need to determine where that 80 hours is going and why. Here are some potential problems to consider:
- Teams are spending extra time in planning meetings to refine requirements and evaluating effort.
- Team members are being interrupted by customer incidents which they are required to support.
- The team must support the weekly release process in addition to their other engineering tasks.
- Miscellaneous meeting are being called by stakeholders including project status meetings and updates.
As you dig into this gap it will become clear what needs to be fixed. The results will probably surprise you. For example, one client was faced with a software quality problem. Determined to improve their software quality, the client added more quality engineers, built more unit tests, and built more automated system tests. While there is nothing inherently wrong with this, it did not address the root cause of their poor quality: Rushing. Engineers were spending about 3-4 hours per day on their engineering tasks. Context switching, interruptions and unnecessary meetings eroded quality engineering time each day. As a result, engineers rushing to complete their work tasks made novice mistakes. Improving engineering performance required a plan for reducing engineering interruptions, unnecessary meetings, and enabling engineers to spend more uninterrupted time on their development tasks.
At another client, the frequency of production support incidents were impacting team velocity. Engineers were being pulled away from their daily engineering tasks to work on problems in production. This had gone on so long that while nobody liked it, they accepted it as normal. It’s not normal! Digging into the issue, the root cause was uncovered: The process for managing production incidents was ineffective. Every incident was urgent and nearly every incident disrupted the engineering team. To improve this, a triage process was introduced whereby each incident was classified and either assigned an urgent status (which would create an interruption for the team) or something lower which was then placed on the product backlog (no interruption for the team). We also learned the old process (every incident was urgent) was in part a response to another velocity problem; stakeholders believed that unless something was considered urgent it would never get fixed by the engineering team. By having an incident triage process, a procedure for when something would get fixed based on its urgency, the engineering team and the stakeholders solved this problem.
At AKF, we are experts at helping engineering teams improve efficiency, performance, fixing velocity problems, and improving value. In many cases, the prescription for the team is not obvious. Our consultants help company leaders uncover the root causes of their performance problems, establish vision and execute prescriptions that result in meaningful change. Let us help you with your performance problems so your teams can perform at their best!
Subscribe to the AKF Newsletter
A Financial Analogy for Technical Debt
October 12, 2018 | Posted By: Bill Armelin
Understanding Technical Debt
During the course of our client engagements, there are a few common topics or themes that are always discussed, and the clients themselves usually introduce them. One such area is technical debt. Every team has it, every team believes they have too much of it, and every team struggles to explain why it’s important to address it.
Let’s start by defining what technical debt means. It is the difference between doing something the “desired” or “best” way and doing something quickly (i.e. reduce time to market). The difference results in the company taking on “debt” within the solution. Technical debt requires acting with forethought. In other words, you only assume technical debt knowingly and with commission. Acts of omission (forgetting to plan or do something) do not count as debt. Our partners in business may think we are hiding the truth if we do not clearly delineate the difference between debt (known assumptions) and mistakes, failures or other issues related to maintenance.
The following list provides examples of things that are not tech debt:
- Software defects (unless we decide to NOT fix them for an extended period of time – but defects are still human failures – not debt.)
- Failures in design that are not previously tagged as debt.
- Failures to identify scalability bottle necks.
- Poor choices in technology components that fail to scale.
- Failure to properly identify infrastructure failures, or high failure rates of vendors in infrastructure.
A Financial Analogy for Tech Debt
When you hear the words “technical debt”, it invokes a negative connotation. However, the judicious use of tech debt is a valuable addition to your product development process. Tech debt is analogous to financial debt. Companies can raise capital to grow their business by either issuing equity or issuing debt. Issuing equity means giving up a percentage of ownership in the company and dilutes current shareholder value. Issuing debt requires the payment of interest but does not give up ownership or dilute shareholder value. Issuing debt is good, until you can’t service it. Once you have too much debt and cannot pay the interest, you are in trouble.
Tech debt operates in the same manner. Companies use tech debt to defer performing work on a product. As we develop our minimum viable product, we build a prototype, gather feedback from the market, and iterate. The parts of the product that didn’t meet the definition of minimum or the decisions/shortcuts made during development represent the tech debt that was taken on to get to the MVP. This is the debt that we must service in later iterations. In fact, our definition of done must include the servicing of the resulting tech debt. Taking on tech debt early can pay big dividends by getting your product to market faster. However, like financial debt, you must service the interest. If you don’t, you will begin to see scalability and availability issues. At that point, refactoring the debt becomes more difficult and time critical. It begins to affect your customers’ experience.
Many development teams have a hard time convincing leadership that technical debt is a worthy use of their time. Why spend time refactoring something that already “works” when you could use that time to build new features customers and markets are demanding now? The danger with this philosophy is that by the time technical debt manifests itself into a noticeable customer problem, it’s often too late to address it without a major undertaking. It’s akin to not having a disaster recovery plan when a major availability outage strikes. To get the business on-board, you must make the case using language business leaders understand – again this is often financial in nature. Be clear about the cost of such efforts and quantify the business value they will bring by calculating their ROI. Demonstrate the cost avoidance that is achieved by addressing critical debt sooner rather than later - calculate how much cost would be in the future if the debt is not addressed now. The best practice is to get leadership to agree and commit to a certain percentage of development time that can be allocated to addressing technical debt on an on-going basis. If they do, it’s important not to abuse this responsibility. Do not let engineers alone determine what technical debt should be paid down and at what rate – it must have true business value that is greater than or equal to spending that time on other activities.
Just as with debt that a company assumes, in and of itself, technical debt is not bad. It can be looked at as a leveraging tool to optimize the technology resources in the short term - delaying a hardware tech refresh or the release date for HTML 5. Delaying attention to address technical issues allows greater resources to be focused on higher priority endeavors. The absence of technical debt probably means missed business opportunities– use technical debt as a tool to best meet the needs of the business. However, excessive technical debt will cause availability and scalability issues, and can choke business innovation (too much engineering time dealing with debt rather than focusing on the product).
Develop a technology balance sheet and profit and loss (income) statement to discuss tech debt with the business in a manner they understand – finance. Let’s first look at the balance sheet, where Assets = Liabilities + Equity. Our assets are the engineering time spent creating the product. Liabilities are the principle of the tech debt (i.e. the difference between “desired” and “actual.” Equity is the remainder, or the engineering resources spent creating the product while not contributing to tech debt.
Here is an example of a technology balance sheet:
To further the financial analogy, we need to have a technology P&L statement. Here, the interest on tech debt is the difficulty or increased level of effort in modifying something in subsequent releases. This manifests as a reduction in developer productivity per value created. The more debt you take on or less principle you pay down, the higher your interest payment becomes, and the cost to the organization.
Dedicating resources on an ongoing basis to service technical debt can be a challenging discussion with the business. Resources are always limited and employing them in the manner which best benefits the business is a critical business priority decision. Similar to the notion of debt within business, you should never take on technical debt without a plan to pay the interest (increased future cost of development) and principal (fixing the difference between appropriate and as-is). Relating technical debt to financial debt can help those outside of your technology organization grasp the concept and understand the need to keep technical debt under control.
One way to make the concept of debt real is to estimate, for any debt item, the amount of “interest” one will need to pay in the future to modify the solution in question.
- For the benefit of time to market, you decide to “hard code” a number of “display strings” that you’d rather set aside in a resource file to modify and translate later.
- You save 2 weeks of development time, creating a 2-week liability on your balance sheet. You have a 2-week principal to fix.
- You estimate that for all future string modifications (or translations) it will take an additional day of development. Your interest is 1 day, payable for each modification.
Just as retiring all financial liabilities at once does not make good business sense, trying to wipe out technical debt in one fell swoop is a bad idea. Continuous service to the technical debt is required to prevent technical liabilities from wiping out technical equity. An informed decision to increase debt service to reduce the principal will result in more productive product development time (smaller debt requires less on-going service). A short-term decision to reduce tech debt service in favor of a critical product launch may be viable if not used often. Keep track of both your principal (balance sheet) and your interest payments (income statement). Use these to help your business partners with debt related decisions.
Do NOT mix the cost of defects, or other infrastructure and software mistakes with tech debt. Doing so creates two very big problems:
- It becomes harder for the technology team to learn from past mistakes. Mistakes are mistakes and we should use them as learning opportunities. Debt is taken thoughtfully. Track them separately and treat them differently.
- Using the debt term for non-debt related items, will lower the level of trust between you and the business. Businesses don’t for instance “mistakenly” take on debt. Mixing these terms can cause relationship problems.
Additionally, be clear about how you define technical debt, so time spent paying it down is not commingled with other activities. Bugs in your code are not technical debt. Refactoring your code base to make it more scalable, however, would be. A good test is to ask if the path you chose was a conscious or unconscious decision. Meaning, if you decided to go in one direction knowing that you would later need to refactor. You are making a specific decision to do or not to do something knowing that you will need to address it later. Bugs are found in sloppy code, and that is not tech debt, it is just bad code.
Prioritizing Tech Debt
So how do you decide what tech debt should be addressed and how do you prioritize? If you have been tracking work with Agile storyboards and product backlogs, you should have an idea where to begin. Also, if you track your problems and incidents like we recommend, then this will show elements of tech debt that have begun to manifest themselves as scalability and availability concerns. Set a budget and begin paying down the debt. If you are working on less 12%, you are not spending enough effort. If you are spending over 25%, you are probably fixing issues that have already manifested themselves, and you are trying to catch up. Setting an appropriate budget and maintaining it over the course of your development efforts will pay down the interest and help prevent issues from arising.
Taking on technical debt to fund your product development efforts is an effective method to get your product to market quicker. But, just like financial debt, you need to take on an appropriate amount of tech debt that you can service by making the necessary interest and principle payments to reduce the outstanding balance. Failing to set an appropriate budget will result in a technical “bankruptcy” that will be much harder to dig yourself out of later.
Tech Debt Takeaways
Here is a list of our tech debt takeaways:
Want help reducing your tech debt? We can help.
Subscribe to the AKF Newsletter
The Domino or Multiplicative Effect of Failure
September 18, 2018 | Posted By: Pete Ferguson
As part of our Technical Due Diligence and Architectural reviews, we always want to see a company’s system architecture, understand their process, and review their org chart. Without ever stepping foot at a client we can begin to see the forensic evidence of potential problems.
Like that ugly couch you bought when you were in college and still have in your front room, often inefficiencies in architecture, process, and organization are nostalgic memories that have long since outlived their purpose – and while you have become used to the ugly couch, outsiders look in and recognize it as the eyesore it is immediately and often customers feel the inefficiencies through slow page loads and shopping cart issues. “That’s how it has always been” is never a good motto when designing systems, processes, and organizations for flexibility, availability, and scalability.
It is always interesting to hear companies talk with the pride of a parent about their unruly kid when they use words like “our architecture/organization is very complex” or “our systems/organization has a lot of interdependent components” – as if either of these things are something special or desirable! Great architectures are sketched out on a napkin in seconds, not hours.
Great architectures are sketched out on a napkin in seconds, not hours.
All systems fail. Complex systems fail miserably, and – like Dominos – take down neighboring systems as well resulting in latency, down time, and/or flat out failure.
ARCHITECTURE & SOFTWARE
Some common observations in hardware/software we repeatedly see:
Problem: Overloaded F5 or other similar firewalls are trying to encrypt all data because Personal Identifiable Information (PII) is stored in plain text, usually left over from a business decision made long ago that no one can quite recall and an auditor once said “encrypt everything” to protect it. Because no one person is responsible for a 30,000 foot view of the architecture, each team happily works in their silo and the decision to encrypt is held up like a trophy without seeing that the F5 is often running hot, causing latency, and is now a bottleneck (resulting in costly requests for more F5s) doing something it has no business doing in the first place.
Solution: Segregate all PII, tokenize it and only encrypt the data that needs to be encrypted, speeding up throughput and better isolating and protecting PII.
Integration (or Rather Lack Thereof) Of Mergers & Acquisitions
Problem: A recent (and often not so recent) flurry of acquisitions is resulting in cross data center calls in and out of firewalls. Purchased companies are still in their own data center or public cloud and the entire workflow of a customer request is crisscrossing the country multiple times not only causing latency, but if one thing goes wrong (remember, everything fails …) timeouts result in customer frustration and lost transactions.
Solution: Integrate services within one isolated stack or swim lane – either hosted or public cloud – to avoid cross data center calls. Replicate services so that each datacenter or cloud instance has everything it needs.
Problem: As the company grew and gained more market share, the search for bigger and better has resulted in a monolithic database that is slow, requires specialized hardware, specialized support, ongoing expensive software licenses, and maintenance fees. As a result, during peak times the database slows everyone and everything down. The temptation is to buy bigger and better hardware and pay higher monthly fees for more bandwidth.
Solution: Break down databases by customer, region, or other Z-Axis splits on the AKF Scale Cube. This has multiple wins – you can use commodity servers instead of large complex file storage, failure for one database will not affect the others, you can place customer data closest to the customer by region, and adding additional servers does not required a long lead time or request for substantial capital expenditure.
PROCESSES & ORGANIZATION
What sets AKF apart is that we don’t just look at systems, we always want to understand the people and organization supporting the system architecture as well and here there are additional multiplicative effects of failure. We have considerable expertise working for and with Fortune 100 companies, startups, and agencies in many different competencies. The common mistakes we see on the organization side of the equation:
Lack of Cross Functional Teams
Problem: Agile Scrum teams do not have all the resources needed within the team to be self sufficient and autonomous. As a result, teams are waiting on other internal resources for approvals or answers to questions in order to complete a Sprint - or keep these items on the backlog because effort estimation is too high. This results in decreased time to market, losing what could have been a competitive advantage, and lower revenue.
Solution: Create cross-functional teams so that each Sprint can be completed with necessary access to security, architecture, QA, and other resources. This doesn’t mean each team needs a dedicated resource from each discipline – one resource can support multiple teams. The information needed can be greatly augmented by creating guildes where the subject matter expert (SME) can “deputize” multiple people on what is required to meet policy. Guilds utilize published standards and provide a dedicated channel of communication to the SME greatly simplifying and speeding up the approval process.
Lack of Automation
Problem: It isn’t done enough! As a result, people are waiting on other people for needed approvals. Often the excuse is that there isn’t enough time or resources. In most cases when we do the math, the cost of not automating far outweighs the short-term investment with a continuous long-term payout that automation would bring. We often see that the individual with the deployment knowledge is insecure and doesn’t want automation as they feel their job is threatened. This is a very short-sighted approach that requires coaching for them to see how much more valuable they can be to the organization by getting out of the way of stifling progress!
Solution: Automate everything possible from testing, quality assurance, security compliance, code compliance (which means you need a good architectural review board and standards), etc! Automation is the gift that keeps on giving and is part of the “secret sauce” of top companies who are our clients.
Not Empowering Teams to Get Stuff Done!
Problem: Often teams work in a silo, only focused on their own tasks and are quick to blame others for their lack of success. They have been delegated tasks, but do not have the ability to get stuff done.
Solution: Similar to cross functional teams, each team must also be given the authority to make decisions (hence why you want the right people from a variety of dependencies on the team) and get stuff done. An empowered team will iterate much faster and likely with a lot more innovation.
While each organization will have many variables both enabling and hindering success, the items listed here are common denominators we see time and time again often needing an outside perspective to identify. Back to the ugly couch analogy, it is often easy to walk into someone else’s house and immediately spot their ugly couch!
Pay attention to those you have hired away from the competition in their early days and seek their opinions and input as your organization’s old bad habits likely look ridiculous to them. Of course only do this with an intent to listen and to learn – getting defensive or stubbornly trying to explain why things are the way they are will not only bring a dead end to you learning, but will also abruptly stop any budding trust with your new hire.
And of course, we are always more than happy to pop the hood and take a look at your organization just as we have been doing for the top banks, Fortune 100, healthcare, and many other organizations. Put our experience to work for you!
Subscribe to the AKF Newsletter
Effective Incident Communications
September 17, 2018 | Posted By: Bill Armelin
Everything fails! This is a mantra that we are always espousing at AKF. At some point, these failures will manifest themselves as an outage. In a SaaS world, restoring service as quickly as possible is critical. It requires having the right people available and being able to communicate with them effectively. A lack of good communications can cause an incident to drag on.
For startups and smaller companies, problems with communications during incidents is less of an issue. Systems tend to be smaller or monolithic. Teams supporting these systems also tend to be small. When something happens, everyone jumps on a call to figure out the problem. As companies grow, the number of people needed to resolve an incident grows. Coordinating communications between a large group of people becomes difficult. Adding to the chaos are executives joining the conference bridges demanding updates about service restoration.
In order to minimize the time to restore a system during an incident, companies need the right people on the call. For large, complex systems, identify the right resources to solve a problem can be difficult. We recommend swarming an issue with everyone that could be needed to resolve an incident, and then release those that are no longer needed. But, with such a large number of people, it can be difficult to coordinate communications, especially on a single conference call bridge.
Managing the communications of a large group of people working an incident is critical to minimizing the restoration time. We recommend a communication method that many of us at AKF learned in the military. It involves using multiple voice and chat channels to coordinate work and the flow of information. Before we get into the details of managing communications, we need to first look at the leadership required to effectively work the incident.
Technical Incident Manager and Incident Communications Manager
Managing a large incident is usually too much for a single individual. She cannot manage coordinating the work occurring to resolve the incident, as well as reporting status to and answering questions from executives eager to know what is going on. We recommend that companies manage incidents with two people. The first person is the individual that is responsible for directing all activities geared towards restoration of service. We call this person the Technical Incident Manager. This individual’s main job is to reduce the mean time to restoration. She needs an overall architectural knowledge of the product and systems to direct the work. She is responsible for leading the call and deescalating after diagnosis informs who needs to be involved. She identifies and diagnoses the service issues and engages the appropriate subject matter experts to assist in restoration.
The second individual is the Incident Communications Manager. He is responsible for supporting the Technical Incident Manager be listening to the technical resolution chatter and summarizing it for a non-technical audience. His focus is on communications speed, quality, and accuracy. He is the primary communications channel for both internal and external messaging. He owns the incident communications process.
Incident Communications Process
This process involves using multiple communication channels to control information and work performed. The first channel established is the Control Channel. This is in the form of a conference bridge and a chat channel. The Technical Incident Manager controls both of these channels. The second channel created is the Status Channel. This also has a voice bridge and a chat channel. The Incident Communication Manager is responsible for managing this channel.
The Control Channel is used for all communication related to the restoration of service. People only use the voice channel for immediate communication and to announce work that is occurring or address immediate questions that need to be answered. Detailed work conducted is placed in the chat channel. This reduces the chatter on the voice channel to command and control messages. It also serves as a record of actions taken that can be referenced in the post mortem/RCA process. If specific teams need to discuss the work they are performing, separate voice and chat breakout channels are created for them. They move off the main channel into their breakout channels to perform the work. The leader of these teams periodically communicates status back up to the control channel.
As the work is progressing, the Incident Communications Manager monitors the Control Channel to provide the basis for his messaging. He formulates updates that he delivers over the Status bridge and chat channel. He keeps executives and customers informed of progress and status, keeping the control channel free of requests for frequent updates and dedicated to restoring service.
This method of communications has worked well in the military for years and has been adopted by many large companies to manage their incident communications. While it is overkill for small companies, it becomes an effective process as companies grow and systems become more complex.
Let us help your organization with incident and crisis management process.
Subscribe to the AKF Newsletter
Are you compromised?
September 14, 2018 | Posted By: Larry Steinberg
It’s important to acknowledge that a core competency for hackers is hiding their tracks and maintaining dormancy for long periods of time after they’ve infiltrated an environment. They also could be utilizing exploits which you have not protected against - so given all of this potential how do you know that you are not currently compromised by the bad guys? Hackers are great hidden operators and have many ‘customers’ to prey on. They will focus on a customer or two at a time and then shut down activities to move on to another unsuspecting victim. It’s in their best interest to keep their profile low and you might not know that they are operating (or waiting) in your environment and have access to your key resources.
Most international hackers are well organized, well educated, and have development skills that most engineering managers would admire if not for the malevolent subject matter. Rarely are these hacks performed by bots, most occur by humans setting up a chain of software elements across unsuspecting entities enabling inbound and outbound access.
What can you do? Well to start, don’t get complacent with your security, even if you have never been compromised or have been and eradicated what you know, you’ll never know for sure if you are currently compromised. As a practice, it’s best to always assume that you are and be looking for this evidence as well as identifying ways to keep them out. Hacking is dynamic and threats are constantly evolving.
There are standard practices of good security habits to follow - the NIST Cybersecurity Framework and OWASP Top 10. Further, for your highest value environments here are some questions that you should consider: would you know if these systems had configuration changes? Would you be aware of unexpected processes running? If you have interesting information in your operating or IT environment and the bad guys get in, it’s of no value unless they get that information back out of the environment; where is your traffic going? Can you model expected outbound traffic and monitor this? The answer should be yes. Then you can look for abnormalities and even correlate this traffic with other activities in your environment.
Just as you and your business are constantly evolving to service your customers and to attract new ones, the bad guys are evolving their practices too. Some of their approaches are rudimentary because we allow it but when we buckle down they have to get more innovative. Ensure that you are constantly identifying all the entry points and close them. Then remain diligent to new approaches they might take.
Don’t forget the most common attack vector - humans. Continue evolving your training and keep the awareness high within your staff - technical and non-technical alike.
Your default mental model should be that you don’t know what you don’t know. Utilize best practices for security and continue to evolve. Utilize external or build internal expertise in the security space and ensure that those skills are dynamic and expanding. Utilize recurring testing practices to identify vulnerabilities in your environment and to prepare against emerging attack patterns.
We commonly help organizations identify and prioritize security concerns through technical due diligence assessments. Contact us today.
Subscribe to the AKF Newsletter
September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-Axis, Y-Axis, and Z-Axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
Subscribe to the AKF Newsletter
The Importance of a Post Mortem
September 6, 2018 | Posted By: James Fritz
“An incident is a terrible thing to waste” is a common mantra that AKF repeats during its Engagements. And rightfully so as many companies have an incident response plan in place but stop there. Why are incidents so important? What is the true value in doing a proper Post Mortem and actually learning from an incident?
Incidents identify issues in your product. But if that is all you take out of an incident then you are missing out on so much more information that an incident can provide. An incident is the first step to identifying a problem that exists in your product, infrastructure processes, and perhaps, people. “But aren’t incidents and problems the same thing?” Not necessarily. An incident is a one time event. It can occur multiple times if you never address the problem, but it is not isolated.
Conducting a Post Mortem
Gather as many data points as possible shortly after an incident concludes and schedule a Post Mortem review meeting.
Start with the incident timeline. Sufficiently logging events over time provides ready access to the needed data for forensic analysis. From this information you can then start to identify what went wrong, when it went wrong and how quickly you were able to respond to it. The below definitions are all factors that need to be identified:
- Time To Detect: How quickly did you identify that an incident had occurred
- Time To Escalate: How quickly did you get everyone necessary to fix the incident involved
- Time To Isolate: How quickly did you stop the incident from affecting other portions
- Time To Restore: How quickly did the system get brought back up
- Time To Repair: How quickly did you fix the incident
This all leads to the Incident Timeline Analysis.
If you can gather information from several incidents and look at them in your Post Mortem review, then you can figure out where your biggest issues are when it comes to incidents and getting the system back up and running. It is not uncommon for us to see that it often takes longer to detect the incident than to restore from it. This could be mitigated with more monitoring at more appropriate positions then you currently have.
Or maybe the time to escalate is an issue. Why does it take so long to get the proper engineers involved? Maybe a real-time alert system is required or a phone tree. And it is important to track and measure total time of an incident as beginning with when it occured (not when it was reported) all the way through to when customers were back up at 100% (not just when your systems were restored).
Problem vs. Incident
How do you know if your incident is also a problem? It’s actually fairly easy to determine. If you have an incident, you have a problem. The scale of the problem may vary by incident but every incident is caused because of something larger than itself.
During our Technical Due Diligences we always want to know how companies categorize incidents vs. problems. If the company properly categorizes problems related to incidents, they will be able to answer “Can you rank your problems to show which cause the most customer impact?” Many times, they can’t - but that ranking is critical to show which problems to attack first.
An incident, at its core, is caused by a problem. If your product crashes anytime someone attempts to access it via an unapproved protocol, the incident is the attempted access. The problem may be an improper review of your architecture. Or it may be lack of QA. Identifying the problem is much more difficult than identifying the incident. Imagine you find a termite on your deck. This small pest could be considered an incident. If you deal with the incident and get rid of the termite everything is good, right? If you don’t look any further than the incident you can’t identify the problem. And in this case the problem could be exposed, untreated wood allowing termites to slowly eat away at the inside of your house.
If you are keeping proper documentation each time you conduct a Post Mortem review, then you will have a history that will start to paint of a picture of ongoing and recurring problems that exist. Remedying the problem will stop the incident from occurring in the same exact way in the future. But small variations of the incident can still occur. If you fix the problem then you are stopping future iterations of that incident from happening again.
Looking for additional help with your incident management process? We can help!
Subscribe to the AKF Newsletter
Migrating from a legacy product to a SaaS service? Don't make these mistakes!!
September 4, 2018 | Posted By: Dave Swenson
AKF has been kept quite busy over the last decade helping companies move from an on-premise product to a SaaS service - often one of the most difficult transitions a company can face. We have found that the following are the top mistakes made during that migration.
With apologies to David Letterman…
The Top 5 SaaS Migration Mistakes
5. Treat Your SaaS Migration Only as a Marketing Exercise
Wall Street values SaaS companies significantly higher than traditional software companies, typically double the revenue multiples. A key reason for this is the improved margins the economies of scale true SaaS companies gain. If you are primarily addressing your customer’s desires to move their IT infrastructure out of their shop, an ASP model hosted in the cloud is fine. However, if your investors or Wall Street are viewing you as a SaaS company, ASP gross margins will not be accepted. A SaaS company is expected to produce in excess of 80% gross margin, whereas an ASP model typically caps out at around 60 or 70%.
How to Avoid?
Make a decision up front on what you want your gross and operating margins to be. This decision will guide how you sell your product (e.g.: highest margins require no code-level customizations), how you architect your systems (multi-tenancy provides greater economies of scale), and even how you release (you, not your customers, control release timing and frequencies). A note of warning: without SaaS margins, you will likely face pricing pressure from an existing or entrant competitor who achieved SaaS margins.
4. Tack the word ‘cloud’ on to your existing on-prem product and host it
Often a direct result of the above mistake, this is an ASP (Application Service Provider) model, not a SaaS one. While this exercise might be useful in exploring some hosting aspects, it won’t truly inform you about what your product and organization needs to become in order to successfully migrate to SaaS. It will result in nowhere near the gross and operating margins true SaaS provides, and your Board and Wall Street expect to see. As discussed in The Many Meanings of Cloud, the danger of tacking the word “Cloud” onto your product offering is that your company will start believing you are a “Contender”, and will stop pushing for true SaaS.
How to Avoid?
Again, if an ASP model is ‘good enough’, fine - just don’t label or market yourselves as SaaS. If you start the SaaS journey with an ASP model, make sure all within your company recognize your ASP implementation is a dead-end, a short-term solution, and that the real endpoint is a true SaaS offering.
3. Target Your Existing Customer Base
Many companies are so focused upon their existing customer base that they forget about entire markets that might be better suited for SaaS. The mistaken perception is
“We need to take customers along with us”,
when in fact, the SaaS reality is
“We need to use the Technology Adoption Lifecycle, compete with ourselves and address a different or smaller customer base first”.
How to Avoid?
Ignore your current customers and find early adopters to target, even if your top customers are pushing you to move to SaaS. A move to true SaaS from your on-premise product almost always requires significant architectural changes. Putting your entire product and code base through these changes will take time.
Instead, apply the Crossing the Chasm Bowling Alley strategy to grow your SaaS offering into your current customer base rather than fork-lifting your current solution in an ASP fashion into the cloud…
Carve out key slices of functionality from your product that can provide value in a standalone fashion, and find early adopters to help shake out your new offering. Even if you are in a ‘laggard’ industry (banking, healthcare, education, insurance), you will find early adopters within your targeted customer base. Seek them out; they will likely be far better partners in your SaaS migration than your existing ones.
2. Ignore Risk Shifts
It may come as a surprise to some within your company to find out how much risk your customers bear in order to host and use your product on-premise. These risks include security, availability, capacity, scalability, and disaster recovery. They also include costs such as software licensing that have been passed through to your customers, but are now yours, part of your operating margins.
How to Avoid?
Many of these “-ilities” are likely to be new disciplines that now must be instilled in your company, some by hiring key individuals (e.g.: a CISO), some through additional focus and rigor during your PDLC. Part of the process of rearchitecting for SaaS includes ensuring you have adequate scalability, are designed with availability in mind, and a production topology that enables disaster recovery. Where vertical scalability might be acceptable in your on-prem world (“just buy a bigger machine”), you now need to ensure you have horizontal scalability, ideally in an elastic form. The cost of proprietary software (e.g.: database licenses) is now yours to carry, and a shift to open source software can significantly improve your margins. These “-ilities” are also known as non-functional requirements (NFRs), and need to be considered with at least as much weight as your functional requirements during your backlog planning and prioritization.
And now, the biggest mistake we see made in SaaS migrations…
1. Underestimate Inertia
Inertia is a powerful force. Over the years spent building up your on-premise capabilities, you’ve almost certainly developed tremendous inertia - defined for our purposes as “a tendency to do nothing or remain unchanged”. In order to achieve true SaaS, in order to satisfy your investors and reach SaaS-like multiples, nearly every part of your company needs to act differently.
How to Avoid?
First, ensure your entire company is ready to embrace change. For many companies, the move to SaaS is the only answer to an existential threat that a known (or unknown competitor) presents, one who is listening to your customers say “Get this IT stuff out of my shop!”. Examples of SaaS disruptors include:
- Salesforce destroying Siebel
- ServiceNow vs. Remedy
- Workday taking on Oracle/Peoplesoft
Is there a disruptor waiting to take over your business?
Many companies choose to disrupt themselves, and after switching to SaaS, drive their stock price through the roof. Look at Adobe’s stock price once they fully embraced (or made their customers embrace) SaaS over packaged software.
Regardless of how you position the importance of SaaS to your employees, there will still be some that are stuck by inertia in the on-prem ways. Either relegate them to stay with the on-prem effort, or ‘weed’ them out.
Once you’ve got some built up momentum and desire within your company to make the move to SaaS, make a concerted examination department-by-department to determine how they will need to change. All involved need to recognize the risk shifts as mentioned, and associated required mindsets. While you likely have excellent, seasoned on-prem employees, do you have enough SaaS experience across each team? The SaaS migration should in no way be treated solely as an engineering exercise.
It always comes as a shock how many departments need to break out of their existing inertia, and act differently. Some examples:
- Can no longer promise code customizations.
- Need to address cloud security concerns.
- Must ensure existing customers know they can no longer dictate when releases occur.
- Learn to speak ARR (annual recurring revenue).
- Should look at alternative revenue schemes (seat vs. utility).
- SaaS presents changes in revenue recognition. Be prepared.
- Should focus in iterative releases that enable product discovery.
- Must learn to balance the NFRs/”-ilities” along with new features.
- Need to consider alternative customers and markets for your new SaaS offering.
- Will likely need to become continually engaged in the PDLC process, in order to stay abreast of releases occurring at far greater frequency than old.
- Must develop (along with engineering) incident management processes that deal with multiple customers simultaneously having issues.
- Better spin up this department in a hurry!
- As no code-level customizations should be happening, you might end up reducing this team, focusing them more on integrations with your customers’ IT or 3rd party products.
Hmm, so much change for this department. Where to start? Start by bringing AKF on board to examine your SaaS migration effort. Here is where we can help you the most.
Subscribe to the AKF Newsletter
Open Source Software as a malware “on ramp”
August 21, 2018 | Posted By: Larry Steinberg
Open Source Software (OSS) is an efficient means to building out solutions rapidly with high quality. You utilize crowdsourced design, development and validation to conveniently speed your engineering. OSS also fosters a sense of sharing and building together - across company boundaries or in one’s free time.
So just pull a library down off the web, build your project, and your company is ready to go. Or is that the best approach? What could be in this library you’re including in your solution which might not be wanted? This code will be running in critical environments - like your SaaS servers, internal systems, or on customer systems. Convenience comes at a price and there are some well known situations of hacks embedded in popular open source libraries.
What is the best approach to getting the benefits of OSS and maintaining the integrity of your solution?
Good practices are a necessity to ensure a high level of security. Just like when you utilize OSS and then test functionality, scale, and load - you should be validating against vulnerabilities. Pre-production vulnerability and penetration testing is a good start. Also, utilize good internal process and reviews. Keep the process simple to maintain speed but establish internal accountability and vigilance on code that’s entering your environment. You are practicing good coding techniques already with reviews and/or peer coding - build an equivalency with OSS.
Always utilize known good repositories and validate the project sponsors. Perform diligence on the committers just like you would for your own employees. You likely perform some type of background check on your employees before making an offer - whether going to a third party or simply looking them up on linkedin and asking around. OSS committers have the same risk to your company - why not do the same for them? Understandably, you probably wouldn’t do this for a third party purchased solution, but your contract or expectation is that the company is already doing this and abiding by minimum security standards. That may not be true for your OSS solutions, and as such your responsibility for validation is at least slightly higher. There are plenty of projects coming from reputable sources that you can rely on.
Ensure that your path to production is only coming from artifacts which have been built on internal sources which were either developed or reviewed by your team. Also, be intentional about OSS library upgrades, this should planned and part of the process.
OSS is highly leveraged in today’s software solutions and provides many benefits. Be diligent in your approach to ensure you only see the upside of open source.
Need additional help? Contact Us!
Subscribe to the AKF Newsletter
The Top 20 Technology Blunders
July 20, 2018 | Posted By: Pete Ferguson
One of the most common questions we get is “What are the most common failures you see tech and product teams make?” To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to Design for Rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW. (Related Content: Monitoring for Improved Fault Detection)
2) Confusing Product Release with Product Success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy! (Related Content: Agile and the Cone of Uncertainty)
3) Insular Product Development / Engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work.” (Related Content: The No Surprises Rule)
4) Over Engineering the Solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
Image Source: Hackernoon.com
5) Allowing History to Repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Vendor Lock
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to Find Your Mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering! Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “Big Bang” Fixes
In our experience, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure – Eliminate Synchronous Calls
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc. then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to Create and Incentivize a Culture of Excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. (Related Content: Three Reasons Your Software Engineers May Not Be Successful)
11) Under-Engineer for Scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now! That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point of relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A New PDLC will Fix My Problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance, in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs. (Related Content: The Top Five Most Common PDLC Failures)
14) Inability to Hire Great People Quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) Diminishing or Ignoring SPOFs (Single Point of Failure)
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity Plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge, like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS-based company, the site and services provided is the company’s sole source of revenue! Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
If you are cloud hosted, this still applies to you! We often find in technical due diligence reviews that small companies who are rapidly growing haven’t yet initiated a second active tech stack in a different availability zone or with a second cloud provider. Just because AWS, Azure and others have a fairly reliable track record doesn’t mean they always will. You can outsource services, but you still own the liability!
Image Source: Kaibizzen.com.au
18) No Product Management Team or Person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) Failing to Implement Continuously
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down (back to #17 - multiple active sites also allows for continuous implementation and the ability to roll back). It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Subscribe to the AKF Newsletter
People Due Diligence
July 12, 2018 | Posted By: Robin McGlothin
Most companies do a thorough job of financial due diligence when they acquire other companies. But all too often, dealmakers simply miss or underestimate the significance of people issues. The consequences can be severe, from talent loss after a deal’s announcement, to friction or paralysis caused by differences in decision-making styles.
When acquirers do their people homework, they can uncover skills & capability gaps, points of friction, and differences in decision making. They can also make the critical people decisions - who stays, who goes, who runs the various lines of business, what to do with the rank and file at the time the deal is announced or shortly thereafter. Making such decisions within the first 90 days is critical to the success of a deal.
Take for example, Charles Schwab’s 2000 acquisition of US Trust. Schwab & the nation’s oldest trust company set out to sign up the newly minted millionaires created by a soaring bull market. But the cultures could not have been farther apart – a discount do-it-yourself stock brokerage style and a full-service provider devoted to pampering multimillionaires can make for a difficult integration. Six years after the merger, Chuck Schwab came out of retirement to fix the issues related to culture clash. The acquisition reflects a textbook common business problem. The dealmakers simply ignored or underestimated the significance of people and cultural issues.
Another example can be found in the 2002 acquisition of PayPal by eBay. The fact that many on the PayPal side referred to it as a merger, sets the stage for conflicting cultures. eBay was often embarrassed by the fact that PayPal invoice emails for a won auction arrived before the eBay end of auction email - PayPal made eBay look bad in this instance and the technology teams were not eager to combine. As well, PayPal titles were discovered to be one level higher than eBay titles considering the scope of responsibilities. Combining the technology teams did not go well and was ultimately scrapped in favor of dual teams - not the most efficient organizational model.
People due diligence lays the groundwork for a smooth integration. Done early enough, it also helps acquirers decide whether to embrace or kill a deal and determine the price they are willing to pay. There’s a certain amount of people due diligence that companies can and must do to reduce the inevitable fallout from the acquisition process and smooth the integration.
Ultimately, the success or failure of any deal has to do with people. Empowering people and putting them in a position where they will be successful is part of our diligence evaluation at AKF Partners. In our experience with clients, an acquiring company must start with some fundamental question:
1. What is the purpose of the deal?
2. Whose culture will the new organization adopt?
3. Will the two cultures mesh?
4. What organizational structure should be adopted?
5. How will rank-and-file employees react to the deal?
Once those questions are answered, people due diligence can focus on determining how well the target’s current structure and culture will mesh with those of the proposed new company, who should be retained and by what means, and how to manage the reaction of the employee base.
In public, deal-making executives routinely speak of acquisitions as “mergers of equals.” That’s diplomatic, politically correct speak and usually not true. In most deals, there is not only a financial acquirer, there is also a cultural acquirer, who will set the tone for the new organization after the deal is done. Often, they are one and the same, but they don’t have to be.
During our Technology Due Diligence process at AKF Partners, we evaluate the product, technology and support organizations with a focus on culture and think through how the two companies and teams are going to come together. Who the cultural acquirer is dependes on the fundamental goal of the acquisition. If the objective is to strengthen the existing product lines by gaining customers and achieving economies of scale, then the financial acquirer normally assumes the role of the cultural acquirer.
People due diligence, therefore, will be to verify that the target’s culture is compatible enough with the acquirers to allow for the building of necessary bridges between the two organizations. Key steps that are often missed in the process:
• Decide how the two companies will operate after the acquisition — combined either as a fully integrated operating company or as autonomous operating companies.
• Determine the new organizational structure and identify areas that will need to be integrated.
• Decide on the new executive leadership team and other key management positions.
• Develop the process for making employment-related decisions.
With regard to the last bullet point, some turnover is to be expected in any company merger. Sometimes shedding employees is even planned. It is important to execute The Weed, Seed & Feed methodology ongoing not just at acquisition time. Unplanned, significant levels of turnover negatively impact a merger’s success.
AKF Partners brings decades of hands-on executive operational experience, years of primary research, and over a decade of successful consulting experience to the realm of product organization structure, due diligence and technology evaluation. We can help your company successfully navigate the people due diligence process.
Monitoring the Good, the Bad, the Ugly for Improved Fault Detection
July 8, 2018 | Posted By: Robin McGlothin
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.
A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents. A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.
Our conversations with clients confirm that detecting these failures is a significant problem. AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them! Application-level failures can sometimes take days to detect, though they are repaired quickly once found. Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
The duration of a product impairment is TTR.
To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening. Then, rely upon application and system monitoring to inform you on where and what has failed. Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates. We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.
In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:
- Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc. Use math to determine abnormal behavior. This is the first line of defense.
- Synthetic Transactions (Simulate User Behavior): Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope. Alerts from a passive monitor can be confirmed or denied and escalated as appropriate. This is the second line of defense.
Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!)
If you can’t identify all problem areas, identify as many as possible. The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly! That’s when things start falling apart and people point fingers.
At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site. Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators. Effective monitoring will detect these faults as well.
The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.
In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control. The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified. If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected. Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.
One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems. An alternative to SPC is First and Second Derivative testing. These tests tell if the actual and expected curve forms are the same.
Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay.
We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the Network Operations Center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine – but our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
How would your company react to this critical change in normal usage patterns? Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
Agile and Dealing With The Cone of Uncertainty
July 8, 2018 | Posted By: Dave Berardi
The Leap of Faith
When we embark on building SaaS product that will delight customers we are taking a leap of faith. We often don’t even know whether or not the outcomes targeted are possible. Investing and building software is often risky for several reasons:
- We don’t know what the market wants.
- The market is changing around us.
- Competition is always improving their time to market (TTM) releasing competitive products and services.
We have to assume there will be project assumptions made that will be wrong and that the underlying development technology we use to build products is constantly changing and evolving. One thing is clear on the SaaS journey – the future is always murky!
The journey that’s plagued with uncertainty for developing SaaS is seen throughout the industry and is evidenced by success and failure from big and small companies – from Facebook to Apple to Salesforce to Google. Google is one of many innovating B2C companies that have used the cone of uncertainty to help inform how to go to market and whether or not to sunset a service. The company realizes that in addition to innovating, they need to reduce uncertainty quickly.
For example, Google Notebook, a browser-based note-taking and information sharing service, was killed and resurrected as part of Google Docs and has a mobile derivative called Keep. Google Buzz, Google’s first attempt at a social network was quickly killed after a little over a year in 2011. These are just a few B2C examples from Google. All of these are examples of investments that faced the cone of uncertainty. Predicting successful outcomes longer term and locking in specifics about a product will only be wasteful and risky.
The cone of uncertainty describes the uncertainty and risk that exist when an investment is made for a software project. The cone depicts the amount of risk and degree of precision for certainty thru the funnel. The further out we try to forecast features, capabilities, and adoption, the more risk and uncertainty we must assume. This is true for what we attempt to define as a product to be delivered and the timing on when we will deliver it to market. Over time, firms must make adjustments to the planned path along the way to capture and embrace changing market needs.
In today’s market we must quickly test our hypothesis and drive innovation to be competitive. An Agile product development life cycle (PDLC) and appropriately aligned organization helps us to do just that. To address the challenge the cone represents, we must understand what an Agile PDLC can do for the firm and what it cannot do for the firm.
Address the Uncertainty of the Cone
When we use an Agile approach, we must fix time and cost for development and delivery of a product but we allow for adjustment and changes to scope to meet fixed dates. The team can extend time later in the project but the committed date to delivery does not change. We also do not add people since Brooks Law teaches us that adding human resources to a late software project only delays it further. Instead we accelerate our ability to learn with frequent deployments to market resulting in a reduction in uncertainty. Throughout this process, discovery of both what the feature set needs to be for a successful outcome and how something should work is accomplished.
Agile allows for frequent iterations that can keep us close to the market thru data. After a deployment, if our system is designed to be monitored, we can capture rich information that will help to inform future prioritization, new ideas about features and modifications that may be needed to the existing feature set. Agile forces us to frequently estimate and as such produce valuable data for our business. The resulting velocity of our sprints can be used to revise future delivery range forecasts for both what will be delivered and when it will be delivered. Data will also be produced throughout our sprints that will help to identify what may be slowing us down ultimately impacting our time to market. Positive morale will be injected into the tams as results can be observed and felt in the short term.
What agile is not and how we must adjust?
While using an Agile method can help address the cone of uncertainty, it’s not the answer to all challenges. Agile does not help to provide a specific date when a feature or scope will be delivered. Instead we work towards ranges. It also does not improve TTM just because our teams started practicing it. Company philosophies, principles, and rules are not defined through an Agile PDLC. Those are up to the company to define. Once defined the teams can operate within the boundaries to innovate. Part of this boundary definition needs to start at the top. Executives need to paint a vivid picture of the desired outcome that stirs up emotion and can be measurable. The vision is at the opening of the cone. Measurable Key Results that executives define to achieve outcomes allow for teams to innovate making tradeoffs as they progress towards the vision. Agile alone does not empower teams or help to innovate. Outcomes, and Key Results (OKRs) cascaded into our organization coupled with an Agile PDLC can be a great combination that will empower teams giving us a better chance to innovate and achieve desirable time to market. Implementing an OKR framework helps to remove the focus of cranking out code to hit a date and redirects the needed attention on innovation and making tradeoffs to achieve the desired outcome.
Agile does not align well with annual budget cycles. While many times, an annual perspective is required by shareholders, an Agile approach is in conflict with annual budgeting. Since Agile sees changing market demands, frequent budget iterations are needed as teams may request additional funding to go after an opportunity. It’s key that finance leaders embrace the importance of adjusting the budgeting approach to align with an Agile PDLC. Otherwise the conflict created could be destructive and create a barrier to the firms desired outcome.
Applying Agile properly benefits a firm by helping to address the cone and reducing uncertainty, empowering teams to deliver on an outcome, and ultimately become more competitive in the global marketplace. Agile is on the verge of becoming table stakes for companies that want to be world class. And as we described above noting the importance of a different approach to something like budgeting, its not just for software – it’s the entire business.
Let Us Help
AKF has helped many companies of all sizes when transitioning to an organization, redefining PDLC to align with desired speed to market outcomes, and SaaS migrations. All three are closely tied and if done right, can help firms compete more effectively. Contact us for a free consultation. We would love to help!
Subscribe to the AKF Newsletter
Three Reasons Your Software Engineers May Not Be Successful
May 10, 2018 | Posted By: Pete Ferguson
Three Reasons Your Software Engineers May Not Be Successful
At AKF Partners, we have the unique opportunity to see trends among startups and well-established companies in the dozens of technical due diligence and more in-depth technology assessments we regularly perform, in addition to filling interim leadership roles within organizations. Because we often talk with a variety of folks from the CEO, investors, business leadership, and technical talent, we get a unique top-to-bottom perspective of an organization.
Three common observations
- People mostly identify with their job title, not the service they perform.
- Software Engineers can be siloed in their own code vs. contributing to the greater outcome.
- CEO’s vision vs. frontline perception of things as they really are.
Job Titles Vs. Services
The programmer who identifies herself as “a search engineer” is likely not going to be as engaged as her counterpart who describes herself as someone who “helps improve our search platform for our customers.”
Shifting focus from a job title to a desired outcome is a best practice from top organizations. We like to describe this as separating nouns and verbs – “I am a software engineer” focuses on the noun without an action: software engineer instead of “I simplify search” where the focus is on verb of the desired outcome: simplify. It may seem minor or trivial, but this shift can be a contributing impact on how team members understand their contribution to your overall organization.
Removing this barrier to the customer puts team members on the front line of accountability to customer needs – and hopefully also the vision and purpose of the company at large. To instill a customer experience, outcome based approach often requires a reworking of product teams given our experience with successful companies. Creating a diverse product team (containing members of the Architecture, Product, QA and Service teams for example) that owns the outcomes of what they produce promotes:
- Creating products customers love
If you have had experience in a Ford vehicle with the first version of Sync (bluetooth connectivity and onscreen menus) – then you are well aware of the frustration of scrolling through three layers of menus to select “bluetooth audio” ([Menu] -> [OK] -> [OK] -> [Down Arrow]-> [OK] -> [Down Arrow] -> [OK]) each time you get into your car. The novelty of wireless streaming was a key differentiator when Sync first was introduced – but is now table stakes in the auto industry – and quickly wears off when having to navigate the confusing UI likely designed by product engineers each focused on a specific task but void of designing for a great user experience. What was missing is someone with the vision and job description: “I design wireless streaming to be seamless and awesome - like a button that says “Bluetooth Audio!!!”
Hire for – and encourage – people who believe and practice “my real job is to make things simple for our customers.”
Avoiding Siloed Approach
Creating great products requires engineers to look outside of their current project specific tasks and focus on creating great customer experiences. Moving from reactively responding to customer reported problems to proactively identifying issues with service delivery in real time goes well beyond just writing software. It moves to creating solutions.
Long gone are the “fire and forget” days of writing software, burning to a CD and pushing off tech debt until the next version. To Millennials, this Waterfall approach is foreign, but unfortunately we still see this mentality engrained in many company cultures.
Today it is all about services. A release is one of many in a very long evolution of continual improvement and progression. There isn’t Facebook V1 to be followed by V2 … it is a continual rolling out of upgrades and bug fixes that are done in the background with minimum to no downtime. Engineers can’t afford to be laggard in their approach to continual evolution, addressing tech debt, and contributing to internal libraries for the greater good.
Ensure your technical team understands and is very closely connected to the evolving customer experience and have skin in the game. Among your customers, there likely is very little patience with “wait until our next release.” They expect immediately resolution or they will start shopping the competition.
Translating the Vision of the CEO to the Front Lines
During our our more in-depth technology review engagements we interview many people from different layers of management and different functions within the organization. This gives us a unique opportunity to see how the vision of the CEO migrates down through layers of management to the front-line programmers who are responsible for translating the vision into reality.
Usually - although not always - the larger the company, the larger the divide between what is being promised to investors/Wall Street and what is understood as the company vision by those who are actually doing the work. Best practices at larger companies include regular all-hands where the CEO and other leaders share their vision and are held accountable to deliverables and leadership checks that the vision is conveyed in product roadmaps and daily stand up meetings. When incentive plans focus directly on how well a team and individual understand and produce products to accomplish the company vision, communication gaps close considerably.
Creating and sustaining successful teams requires a diverse mix of individuals with a service mindset. This is why we stress that Product Teams need to be all inclusive of multiple functions. Architecture, Product, Service, QA, Customer Service, Sales and others need to be included in stand up meetings and take ownership in the outcome of the product.
The Dev Team shouldn’t be the garbage disposal for what Sales has promised in the most recent contract or what other teams have ideated without giving much thought to how it will actually be implemented.
When your team understands the vision of the company - and how customers are interacting with the services of your company - they are in a much better position to implement it into reality.
As a CTO or CIO, it is your responsibility to ensure what is promised to Wall Street, private investors, and customers is translated correctly into the services you ultimately create, improve, and publish.
As we look at new start-ups facing explosive 100-200% year-over-year growth, our question is always “how will the current laser focus vision and culture scale?” Standardization, good Agile practices, understanding technical debt, and creating a scalable on-boarding and mentoring process all lend to best answers to this question.
When your development teams are each appropriately sized, include good representation of functional groups, each team member identifies with verbs vs. nouns (“I improve search” vs. “I’m a software engineer”), and understand how their efforts tie into company success, your opportunities for success, scalability, and adaptability are maximized.
Do You Know What is Negatively Affecting Your Engineers’ Productivity? Shouldn’t You?
Enabling Time to Market (TTM) With Contributor Model Teams
Experiencing growing or scaling pains? AKF is here to help! We are an industry expert in technology scalability, due diligence, and helping to fill leadership gaps with interim CIO/CTO and other positions in addition to helping you in your search for technical leaders. Put our 200+ years of combined experience to work for you today!
Subscribe to the AKF Newsletter
Enabling TTM With Contributor Model Teams
May 6, 2018 | Posted By: Dave Berardi
Enabling TTM With Contributor Model Teams
We often speak about the benefits of aligning agile teams with the system’s architecture. As Conway’s Law describes, product/solution architectures and organizations cannot be developed in isolation. (See https://akfpartners.com/growth-blog/conways-law) Agile autonomous teams are able to act more efficiently, with faster time to market (TTM). Ideally, each team should be able to behave like a startup with the skills and tools needed to iterate until they reach the desired outcome.
Many of our clients are under pressure to achieve both effective TTM and reduce the risk of redundant services that produce the same results. During due diligence, we will sometimes discover redundant services that individual teams develop within their own silo for a TTM benefit. Rather than competing with priorities and waiting for a shared service team to deliver code, the team will build their own flavor of a common service to get to market faster.
Instead, we recommend a shared service team own common services. In this type of team alignment, the team has a shared service or feature on which other autonomous teams depend. For example, many teams within a product may require email delivery as a feature. Asking each team to develop and operate its own email capability would be wasteful, resulting in engineers designing redundant functionality leading to cost inefficiencies and unneeded complexity. Rather than wasting time on duplicative services, we recommend that organizations create a team that would focus on email and be used by other teams.
Teams make requests in the form of stories for product enhancements that are deposited in the shared services team’s backlog. (email in this case) To mitigate the risk of having each of these requesting teams waiting for requests to be fulfilled by the shared services team, we suggest thinking of the shared services as an open source project or as some call it – the contributor model.
Open sourcing our solution (at least internally) doesn’t mean opening up the email code base to all engineers and letting them have at it. It does mean mechanisms should be established to help control the quality and design for the business. An open source project often has its own repo and typically only allows trusted engineers, called Committers, to commit. Committers have Contribution Standards defined by the project owning team. In our email example, the team should designate trusted and experienced engineers from other Agile teams that can code and commit to the email repo. Engineers on the email team can be focused on making sure new functionality aligns with architectural and design principles that have been established. Code reviews are conducting before its accepted. Allowing for outside contribution will help to mitigate the potential bottleneck such a team could create.
Now that the development of email has been spread out across contributors on different teams, who really owns it?
Remember, ownership by many is ownership by none. In our example, the email team ultimately owns the services and code base. As other developers commit new code to the repo, the email team should conduct code, design, and architectural reviews and ultimately deployments and operations. They should also confirm that the contributions align with the strategic direction of the email mission. Whatever mechanisms are put in place, teams that adopt a contributor model should be a gas pedal and not a brake for TTM.
If your organization needs help with building an Agile organization that can innovate and achieve competitive TTM, we would love to partner with you. Contact us for a free consultation.
Subscribe to the AKF Newsletter
The Top Five Most Common Agile PDLC Failures
April 27, 2018 | Posted By: Dave Swenson
Agile Software Development is a widely adopted methodology, and for good reason. When implemented properly, Agile can bring tremendous efficiencies, enabling your teams to move at their own pace, bringing your engineers closer to your customers, and delivering customer value
quicker with less risk. Yet, many companies fall short from realizing the full potential of Agile, treating it merely as a project management paradigm by picking and choosing a few Agile structural elements such as standups or retrospectives without actually changing the manner in which product delivery occurs. Managers in an Agile culture often forget that they are indeed still managers that need to measure and drive improvements across teams.
All too often, Agile is treated solely as an SDLC (Software Development Lifecycle), focused only upon the manner in which software is developed versus a PDLC (Product Development Lifecycle) that leads to incremental product discovery and spans the entire company, not just the Engineering department.
Here are the five most common Agile failures that we see with our clients:
- Technology Executives Abdicate Responsibility for their Team’s Effectiveness
Management in an Agile organization is certainly different than say a Waterfall-driven one. More autonomy is provided to Agile teams. Leadership within each team typically comes without a ‘Manager’ title. Often, this shift from a top-down, autocratic, “Do it this way” approach to a grass-roots, bottoms-up one sways way beyond desired autonomy towards anarchy, where teams have been given full freedom to pick their technologies, architecture, and even outcomes with no guardrails or constraints in place. See our Autonomy and Anarchy article for more on this.
Executives often become focused solely on the removal of barriers the team calls out, rather than leading teams towards desired outcomes. They forget that their primary role in the company isn’t to keep their teams happy and content, but instead to ensure their teams are effectively achieving desired business-related outcomes.
The Agile technology executive is still responsible for their teams’ effectiveness in reaching specified outcomes (e.g.: achieve 2% lift in metric Y). She can allow a team to determine how they feel best to reach the outcome, within shared standards (e.g.: unit tests must be created, code reviews are required). She can encourage teams to experiment with new technologies on a limited basis, then apply those learnings or best practices across all teams. She must be able to compare the productivity and efficiencies from one team to another, ensuring all teams are reaching their full potential.
- No Metrics Are Used
The age-old saying “If you can’t measure it, you can’t improve it” still applies in an Agile organization. Yet, frequently Agile teams drop this basic tenet, perhaps believing that teams are self-aware and critical enough to know where improvements are required. Unfortunately, even the most transparent and aware individuals are biased, fall back on subjective characteristics (“The team is really working hard”), and need the grounding that quantifiable metrics provide. We are continually surprised at how many companies aren’t even measuring velocity, not necessarily to compare one team with another, but to compare a team’s sprint output vs. their prior ones. Other metrics still applicable in an Agile world include quality, estimation accuracy, predictability, percent of time spent coding, the ratio of enhancements vs. maintenance vs. tech debt paydown.
These metrics, their definitions and the means of measuring them should be standardized across the organization, with regular focus on results vs. desired goals. They should be designed to reveal structural hazards that are impeding team performance as well as best practices that should be adopted by all teams.
- Your Velocity is a Lie
Is your definition of velocity an honest one? Does it truly measure outcomes, or only effort? Are you consistent with your definition of ‘done’? Take a good look at how your teams are defining and measuring velocity. Is velocity only counted for true ‘ready to release’ tasks? If QA hasn’t been completed within a sprint, are the associated velocity points still counted or deferred?
Velocity should not be a measurement of how hard your teams are working, but instead an indicator of whether outcomes (again, e.g.: achieve 2% lift in metric Y) are likely to be realized - take credit for completion only when in the hands of customers.
- Failure to Leverage Agile for Product Discovery
From the Agile manifesto: “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software”. Many companies work hard to get an Agile structure and its artifacts in place, but ignore the biggest benefit Agile can bring: iterative and continuous product discovery. Don’t break down a six-month waterfall project plan into two week sprints with standups and velocity measurements and declare Agile victory.
Work to create and deliver MVPs to your customers that allow you to test expected value and customer satisfaction without huge investment.
- Treating Agile as an SDLC vs. a PDLC
As explained in our article PDLC or SDLC, SDLC (Software Development Lifecycle) lives within PDLC (Product Development Lifecycle). Again, Agile should not be treated as a project management methodology, nor as a means of developing software. It should focus on your product, and hopefully the related customer success your product provides them. This means that Agile should permeate well beyond your developers, and include product and business personnel.
Business owners or their delegates (product owners) must be involved at every step of the PDLC process. PO’s need to be embedded within each Agile team, ideally colocated alongside team members. In order to provide product focus, POs should first bring to the team the targeted customer problem to be solved, rather than dictating only a solution, then work together with the team to implement the most effective solution to that problem.
AKF Partners helps companies transition to Agile as well as fine-tune their existing Agile processes. We can readily assess your PDLC, organization structure, metrics and personnel to provide a roadmap for you to reach the full value and benefits Agile can provide. Contact us to discuss how we can help.
Subscribe to the AKF Newsletter
Evaluating Technology Risk Using Mathematics
April 23, 2018 | Posted By: AKF
The Relative Risk Equation
Technologists are frequently asked: what are the chances that a given software release is going to work? Do we understand the risk that each component or new feature brings to the entire release?
In this case, measuring risk is assessing the probability that a component will perform poorly or even fail. The higher the probability of failure, the higher the risk. Probabilities are just numbers, so ideally, we should be able to calculate the risk (probability of failure) of the entire release by aggregating the individual risk of each component. In reality this calculation works quite well.
Putting a number to risk is a very useful tool and this article will provide a simple and easy way to calculate risk and produce a numeric result which can be used to compare risk across a spectrum of technology changes.
When we assess a system, one of the key characteristics we want to benchmark is the probability that a system will fail. In particular, if we want to understand whether or not a system can support an availability goal of 99.95% we have to do some analysis to see if the probability that a failure occurs is lower than 0.05%. How do we calculate this?
First let’s introduce some vocabulary.
DEFINITION: Pi is the probability that a given system will experience an incident, i.
For the purposes of this article we are measuring relative and not absolute values. A system where Pi=1 means the system is very unlikely to experience failure. On other hand, Pi values approaching 10 indicate a system with a 100% probability of failure.
DEFINITION: Ii is the impact (or blast radius) a system failure will have.
Ii=1 indicated no impact where Ii=10 indicates a complete failure of an entire system.
DEFINITION: Pd is the probability that an incident will be detected.
Pd=1 means an incident will be completely undetected and Pd=10 indicates that a failure will be completely detected 100% of the time.
Measuring across a scale from 1 to 10 is often too granular; we can reduce scale to tee-shirt sizes and replace 1, 2, …, 10 with Small (3), Medium (5), Large (7). Any series of values will work so long as we are consistent in our approach.
Relative Risk is now only a question of math:
Ri = (Pi x Ii ) / Pd
With values of 1, 2, ..., 10, the minimum Relative Risk value is 0.1 (effectively 0 relative risk) and the maximum value is 100. With tee-shirt sizes, the minimum Relative Risk value is 6/7 and the maximum value is 16.333. Basic statistics can help us to standardize values from 1 to 10:
std(Ri) = (Ri - Min(Ri)) / (Max(Ri) - Min(Ri)) x 10
where Max(Ri) = 16 1/3 and Min(Ri) = 1 2/7 (in the case of tee-shirt sizing)
Example 1: Adding a new data file to a relational database
- Pi = 3 (low.). It’s unlikely that adding a data file will cause a system failure, unless we’re already out of space.
- Ii = 5 (medium.). A failure to add a datafile indicates a larger storage issue may exist which would be very impactful for this database instance. However as there is a backup (for this example), the risk is lowered.
- Pd = 7 (high). It is virtually certain that any failure would be noticed immediately.</li>
- Therefore Ri = (3 X 5) / 7 = 2.1. Standardizing this to our 1 to 10 scale produces a value of 0.51. That is a very low number, so adding datafiles is relatively low risk procedure.
Example 2: Database backups have been stored on tapes that have been demagnetized during transportation to offsite storage.
- Pi = 5 (medium.). While restoring a backup is a relatively safe event, on a production system it is likely happening during a time of maximum stress.
- Ii = 7 (high.). When we attempt to restore the backup tape, it will fail.
- Pd= 3 (low.) The demagnetization of the tapes was a silent and undetected failure.
- Therefore Ri = (5 X 7) / 3 or 11.7. We arrive at a value of 7 on the standard scale, which is quite high, so we should consider randomly testing tapes from off-site storage.
This formula has utility across a vast spectrum of technology:
- Calculate a relative risk value for each feature in a software release, then take the total value of all features in order to compare risk of a release against other releases and consider more detailed testing for higher relative risk values.
- During security risk analysis, calculate a relative risk value for each threat vector and sort the resulting values. The result is a prioritized list of steps required to improve security based on the probabilistic likelihood that a threat vector will cause real damage.
- During feature planning and prioritization exercises, this formula can be altered to calculate feature risk. For example, Pi can mean confidence in estimate, Ii can be converted to impact of feature (e.g. higher revenue = higher impact) and Pd is perceived risk of the feature. Putting all features through this calculation then sorting from high values to low values yields a list of features ranked by value and risk.
The purpose of this formula and similar methods is not to produce a mathematically absolute estimate of risk. The real value here is to remove guessing and emotion from the process of evaluating risk and providing a framework to compare risk across a variety of changes.
Click here to see how AKF Partners can help you manage risk and other technology issues.
Subscribe to the AKF Newsletter
Avoiding the Policy Black Hole
April 18, 2018 | Posted By: Pete Ferguson
During due diligence and in-depth engagements, we often hear feedback from client employees that policies either do not exist - or are not followed.
All too often we see policies that are poorly written, difficult for employees to understand or find, and lack clear alignment with the desired outcomes. Policies are only one part of a successful program - without sound practices, policies alone will not ensure successful outcomes.
Do You Have a Policy …?
Early in my career I was volunteered to be responsible for PCI compliance shortly after eBay purchased PayPal. I’d heard folklore of auditors at other companies coming in and turning things over with the resulting aftermath leading to people being publicly humiliated or losing their job. I suddenly felt on the firing line and asking “why me?”
I booked a quick flight to Phoenix to be in town before the auditor arrived and I prepared by walking through our data center and reviewing our written policies. When I met with the auditor, he looked to be in his early 20s and handed me a business card from a large accounting firm. I asked him about his background; he was fresh out of college and we were one of his first due diligence assignments. He pulled out his laptop and opened an Excel spreadsheet and began reading off the list:
- Do you have cameras? “Yes,” I replied and pointed to the ceiling in the lobby littered with little black domes.
- Do you record the cameras? “Yes,” and I took him into the control room and showed him that we had 90 days of recording.
- Do you have a security policy? “Yes,” and I showed him a Word Document starting with “1.1.1 Purpose of This Policy ....”
Several additional questions, and 10 minutes later, we were done. He and I had both flown some distance so I gave him a tour of the data center and filled him full of facts about square footage and miles of cable and pipes until his eyes glossed over and his feet were tired from walking and off he went.
I was relieved, but let down! I felt we had a really good program and wanted to see how we measured up under scrutiny. Subsequent years brought more sophisticated reviews - and reviewers - but the one question I was always waiting to be asked - but never was:
“Is your policy easily accessible, how do employees know about it, and how do you measure their comprehension and compliance?”
My first compliance exercise didn’t seem all that scary after all, it was only a due diligence “check the box” exercise and didn’t dive deeper into how effective our program was and where it needed to be reinforced.
While having a policy for compliance requirements is important, on its own, policy does not guarantee positive outcomes. Policy must be aligned with day-to-day operations and make sense to employees and customers.
The Traditional Boredom of Policy
Typically policy is written from the auditor’s point of view to ensure compliance to government and industry requirements for public health, anti-corruption, and customer data security standards.
Image Credit: Imgur.com
Unfortunately, this leads to a very poor user experience wading through the 1.1.1 … 1.1.2 … . Certainly a far deviation from how a good novel or any online news story reads.
I’ve heard companies - both large and small - give great assurances that they have policies and they have shown me the 12pt Times New Roman documents that start with “1.1.1 Purpose of This Policy …” as evidence.
I had to argue the point at a former position that the first way to lose interest with any audience is to start with 1.1.1 … and with Times New Roman font in a Microsoft Word document that was not easy to find. It was a difficult argument and I was instructed to stick with the approved, and traditional, industry-accepted method.
Fast forward a decade later and our HR Legal team was reviewing policy and invited me to a meeting with the internal communications team. Before we started talking documents, the Director of Communications asked me if I’d seen the latest safety video for Virgin Atlantic Airlines. I thought it a strange question, but after she told me how surprised and inspired by it she was, I took a look.
VA thankfully took a required dull and mundane US Federal Aviation Administration ritual and instead saw it as a differentiator of their brand from the pack of other airlines. Whoever thought a safety demonstration could also be a 4-minute video on why an airline is different and fun?!? Up until that point, no one! Certainly not on any flight I had previously flown.
Thankfully, since then, Delta and others have followed their example and made something I and millions of airline crews and passengers had previously dreaded - safety policy and procedure - into a more fun, engaging, and entertaining experience.
While policy needs to comply with regulations and other requirements, for policies to move from the page to practice they need to be presented in a way employees clearly understand what is expected - so in writing policy, put the desired outcome first! The regulatory document for auditors can be incorporated at the end of each policy or consider a separate document that calls out only the required sections of your employe handbook or where ever your company policies are presented and stored.
Clarifying the Purpose of Your Policy
In her article “Why Policies Don’t Work,” HR Lawyer Heather Bussing boils down the core issue: “There are two main reasons to have employment policies: to educate and to manage risk. The trouble is that policies don’t do either.”
She further expounds on the problem in her experience:
“ … policies get handed out at a time when no one pays attention to them (first week of employment if not the first day), they are written by people who don’t know how the company really works (usually outside legal counsel), and they have very little to do with what happens. So much for education.”
As for managing risk, Bussing points out that policies are often at odds with each other, or so broad that they can’t be effectively enforced.
“Unless it is required to be on a poster, or unless you can apply it in every instance without variance, you don’t want policies. Your at-will policy covers it. And if you don’t follow your policies to the letter, you will look like a liar in a courtroom.”
Don’t let your online policy repository feel like a suppository - focus on what you want to accomplish!
Small and fast-growing companies typically have little need for formalized policies because people trust each other and can work things out. But as they grow it has been my experience that often the trust and holding people accountable - which sets fast growing companies apart as a cool place to work - get replaced with bureaucratic rituals cemented in place as more and more executives migrate from larger, bureaucratic behemoths. If the way policy is presented is the litmus test for the true company culture, a lot of companies are in trouble!
Policy must be closely aligned to the shared outcomes of the company and interwoven into company culture. Otherwise they are a bureaucratic distraction and will only be adopted or sustained with a lot of uphill effort. In short, if people do not understand how a policy helps them do their job more easily, they are going to fight it.
Adapting Policy To Your Audience
In the early days of eBay, the culture was very much about collectables, and walking through the workspace many employees displayed their collections of trading cards, Legos, and comic books. When it came time to publish our security policies, we hired Foxnoggin - a professional marketing strategy company - and took the time to get to understand our culture and then organized a comprehensive campaign to include contests, print and online material, and other collateral.
They helped formulate an awareness campaign to educate employees and measure the effectiveness of policy through surveys and monitoring employee behavior.
To break away from the usual email method of communication, we got and held employee attention with a series of comic books which included superheroes and supervillains in a variety of scenarios highlighting our policies.
An unintended consequence from our collector employees was that they didn’t want to open their comic books and instead kept them sealed in plastic. To combat this, we provided extra copies (not sealed in plastic) in break rooms and other common areas and future editions were provided without the bags. The messages were reinforced with large movie-style posters displayed throughout the work area.
This approach was wildly popular among employees located at the customer support and developer sites and surveys showed that security was becoming a top of mind topic for employees. Unfortunately, this approach was not as popular with Europeans - who felt we were talking down to them - and by the executives coming from more stodgy and formal companies like Bain & Company or GE and particularly unpopular with execs from the financial industry after the purchase of PayPal.
Intertwining policy into the culture of your organization makes compliance natural and part of daily operations.
Make Sure Your Message Matches Your Audience
President and CEO of Lead From Within Lolly Daskal writes on Inc.com:
“... sometimes the dumbest rules can drive away the best employees … too many workplaces create rule-driven cultures that may keep management feeling like things are under control, but they squelch creativity and reinforce the ordinary.”
Be creative and look at the company culture and how to interweave policies. Policies need to be part of the story you tell your employees to reinforce why they should want to work for you.
Nathan Christensen writes in his Fast Company article: How to Create An Employee Handbook People Will Actually Want to Read, “let’s face it, most handbooks aren’t exactly page-turners. They’re documents designed to play defense or, worse yet, a catalog of past workplace problems.”
Christensen recommends “presenting” policies in a readable and attractive manner. It must be an opportunity to excite people in meeting a greater group purpose and cause.
Your policies need to match your company culture, be in language they use and and understand, and the ask for compliance needs to be easily enough for a new employee to be able to explain to anyone.
Writing Content Your Audience Will Actually Read and Understand
According to the Center for Plain Language - which has the goal to help organizations “write so clearly that their intended audience understands what they are saying the first time they read or hear it” - there are five steps to plain language:
- Identify and describe the target audience: “The audience definition works when you know who you are and are not designing for, what they want to do, and what they know and need to learn.”
- Structure the content to guide the reader through it: “The structure works when readers can quickly and confidently find the information they are looking for.”
- Write the content in plain language: “Use a conversational, rather than legal our bureaucratic tone … pick strong verbs in the active voice and use words the audience knows.”
- Use information design to help readers see and understand: Font choice, line spacing, and use of graphics help break up long sections of text and increase the readability score.
- Work with target user groups to test design and content: Ask readers to describe the content and have them show you where they would find relevant content.
As an illustration, here is a before and after comparison of the AARP Financial policy on giving and receiving gifts:
In reading the “before” example, my eyes immediately glazed over and my mind began to wander until the mention of “courtesies of a de minimus ... “ Did the guy who wrote that go home that night to his family and instruct his kids, “you will need to consume a courtesise of a de minimus amount of broccoli if you want videogame time after dinner”? I sure hope not!
On the “after” example, notice the change in line spacing, switching of font and use of bullet points. Overall the presentation is a lot more conversational and less formal. It also has a call to action in the title starting with two verbs “give and accept …”
I’d add as the 6th step to remember K.I.S.S. - Keep It Simple Stupid! You get a few seconds to grab your audience’s attention and only a few more minutes to keep it.
As a content editor, I was feeling proud of myself when I distilled 146 pages of confusing policies, procedures and “how to” down to 14 pages over the course of several weeks. But when I mentioned this to my wife, she said “you are going to make them read 14 pages?!?”
So I looked at it a few days later with fresh eyes and realized I could condense it down again to two pages by making it more of a table of contents with a brief description of each bullet point and then include links after each section if employees wanted to learn more, and I was able to retain a font size of 14 and plenty of white space.
In reading the two pages, people would understand what was expected of them and could easily learn more - but only if they were interested.
Write policy in language a new employee will quickly understand and be thoughtful in how much you present to employees on their first day, week, and month.
Document Readability is How You Show Your People Love - And Soon To Be the Law In the EU
Speaking more in terms of content marketing, VisibleThread author “Fergle” quotes Neil Patel, columnist for Forbes, Inc, as stating “content that people love and content that people can read is almost the same thing.” Yet, as Fergle points out, “a lot of content being created is not the stuff people love. Or read.”
“Content that people love and content that people can read is almost the same thing.”
Writing content with the aim of it being easy to read as something people love may seem a bit altruistic. But for information regarding data privacy, it is also soon to be the law - at least in the EU and for any international policy which would reach an EU resident. On May 28th of 2018 the General Data Protection Regulation (GDPR) goes into effect. From the GDPR :
“The principle of transparency requires that any information and communication relating to the processing of those personal data be easily accessible and easy to understand, and that clear and plain language be used.”
There are a number of ways to measure readability ease and grade level of your content, and a good communications expert will be able to help you identify the proper tools.
Scores are a good benchmark, but don’t forget the most important resource for feedback - your potential audience!
Buy them lunch, have them come and review your plan and provide their feedback. Bring them back in later when you have content to review and provide an environment where they can be brutally honest - again a communications expert outside of your department will help provide a bit of a buffer and allow your audience to be open, honest, and direct.
But don’t just write policy to comply with due diligence or for policy’s sake - be sure it is part of the company culture, easy to search, and placed where and when your employees or customers will need it. When there are shared outcomes between compliance and how employees operate, policy is integrated and effective.
Timing is Important
Think of ways to break down your policy content not just by audience, but by timing and when the information will actually be relevant.
In retail, the term “point of sale” refers to the checkout process - when taxes, final cost and payment are all settled. The placement of “last minute items” at the POS is very carefully, and competitively assigned only to items with a high ROI measured by the amount of inches each item takes up on the limited shelf space. This careful placement has also been adopted to the online marketplace when you add an item to your shopping cart and a prompt arises to add additional items others have also purchased with your item.
This same methodology in thinking should be applied to where - and when - you introduce your policies to your audience.
We made the mistake for years of pushing our travel safety program and policies for everyone during new hire orientation when only about half of the population traveled and most of them wouldn’t be traveling for several weeks or months. It made a lot more sense to move the travel policies to the travel booking page.
If you only give out corporate credit cards to Directors and above, there is no sense pushing policies on spend limits to the global population. It makes a lot more sense to push the policy when someone is applying for the card and as a reminder each time their credit card expires and they are being issued a new one.
Your audience will appreciate only being told what they need to know when they need the information and will be more likely to not only retain the information, but to comply!
For similar content on our Growth Blog, click here
Know How You Will Measure Successful Outcomes
Perhaps the most important question to ask when designing policy is “how we will know we are successful?”
Having good policy written in a clear and concise manner and stored in an easy to find location is still a very passive approach. Good policy should evolve as your company evolves and should be flexible and realistic to business, customer, and employee needs. It must be modeled by company leadership and hold true to the daily actions of your company.
Tests at the end of annual compliance training are only a “check the box” measure of compliance. Think back to how much you actually learned - or, better yet, retained - the last time you were subjected to hours of compliance training!
If metrics cannot support that your policies are known and followed, then you need to re-evaluate the purpose of your policies and if they are contributing to the benefit of your employees and customers or just ticking compliance boxes.
While compliance is important, compliance alone does not make for better business practices or a competitive edge. Effective, measurable compliance protects your employees and provides value to your customers.
Subject-matter experts are often too close to the policies to be objective. A little tough love is needed and it is best to bring in experts in marketing and communications who will not be biased to the content, but biased to the reader who is the intended audience.
A good communications plan will cover the following:
- Be clear on the desired behavior the policy is to encourage and enforce - and that behavior is streamlined with the overall company purpose
- Identify the target audiences and each of their self-interests
- Outline which channels each audience is receptive to (email/print/video, etc.)
- Identify the inside jargon and language styles needed
- Decide when and where each audience will want to find relevant information
- Plan how often policies will be reviewed - and include as many stakeholders as possible in the review process
- Decide how implementation of policies and compliance to the policies will be measured
Only AFTER the communications plan is agreed upon - with plenty of input from representatives of the target audiences - should the content review begin. Otherwise the temptation from subject-matter experts will be to tell people everything they know.
Pulling it All Together
Poorly written policies that are difficult for employees to search or find do little to meet the mission of policy: to provide a consistent approach to how your company does business and satisfies regulatory compliance. Policies on their own do not make for good operations or guarantee overall success. Remember the true test of policies is not whether they exist, but if they are tightly aligned and incorporated into daily operations, how they contribute to the success of your employees and customers, and if their effectiveness can be measured in a tangible way.
Experiencing growing pains? AKF is here to help! We are an industry expert in technology scalability and due diligence. Put our 200+ years of combined experience to work for you today!
Get this article and others like it by signing up for our newsletter.
Subscribe to the AKF Newsletter
SaaS Migration Challenges
March 12, 2018 | Posted By: Dave Swenson
More and more companies are waking up from the 20th century, realizing that their on-premise, packaged, waterfall paradigms no longer play in today’s SaaS, agile world. SaaS (Software as a Service) has taken over, and for good reason. Companies (and investors) long for the higher valuation and increased margins that SaaS’ economies of scale provide. Many of these same companies realize that in order to fully benefit from a SaaS model, they need to release far more frequently, enhancing their products through frequent iterative cycles rather than massive upgrades occurring only 4 times a year. So, they not only perform a ‘lift and shift’ into the cloud, they also move to an Agile PDLC. Customers, tired of incurring on-premise IT costs and risks, are also pushing their software vendors towards SaaS.
SaaS Migration is About More Than Just Technology – It is An Organization Reboot
But, what many of the companies migrating to SaaS don’t realize is that migrating to SaaS is not just a technology exercise. Successful SaaS migrations require a ‘reboot’ of the entire company. Certainly, the technology organization will be most affected, but almost every department in a company will need to change. Sales teams need to pitch the product differently, selling a leased service vs. a purchased product, and must learn to address customers’ typical concerns around security. The role of professional services teams in SaaS drastically changes, and in most cases, shrinks. Customer support personnel should have far greater insight into reported problems. Product management in a SaaS world requires small, nimble enhancements vs. massive, ‘big-bang’ upgrades. Your marketing organization will potentially need to target a different type of customer for your initial SaaS releases - leveraging the Technology Adoption Lifecycle to identify early adopters of your product in order to inform a small initial release (Minimum Viable Product).
It is important to recognize the risks that will shift from your customers to you. In an on-premise (“on-prem”) product, your customer carries the burden of capacity planning, security, availability, disaster recovery. SaaS companies sell a service (we like to say an outcome), not just a bundle of software. That service represents a shift of the risks once held by a customer to the company provisioning the service. In most cases, understanding and properly addressing these risks are new undertakings for the company in question and not something for which they have the proper mindset or skills to be successful.
This company-wide reboot can certainly be a daunting challenge, but if approached carefully and honestly, addressing key questions up front, communicating, educating, and transparently addressing likely organizational and personnel changes along the way, it is an accomplishment that transforms, even reignites, a company.
This is the first in a series of articles that captures AKF’s observations and first-hand experiences in guiding companies through this process.
Don’t treat this as a simple rewrite of your existing product –
Answer these questions first…
Any company about to launch into a SaaS migration should first take a long, hard look at their current product, determining what out of the legacy product is not worth carrying forward. Is all of that existing functionality really being used, and still relevant? Prior to any move towards SaaS, the following questions and issues need to be addressed:
Customization or Configuration?
SaaS efficiencies come from many angles, but certainly one of those is having a single codebase for all customers. If your product today is highly customized, where code has been written and is in use for specific customers, you’ve got a tough question to address. Most product variances can likely be handled through configuration, a data-driven mechanism that enables/disables or otherwise shapes functionality for each customer. No customer-specific code from the legacy product should be carried forward unless it is expected to be used by multiple clients. Note that this shift has implications on how a sales force promotes the product (they can no longer promise to build whatever a potential customer wants, but must sell the current, existing functionality) as well as professional services (no customizations means less work for them).
Many customers, even those who accept the improved security posture a cloud-hosted product provides over their own on-premise infrastructure, absolutely freak when they hear that their data will coexist with other customers’ data in a single multi-tenant instance, no matter what access management mechanisms exist. Multi-tenancy is another key to achieving economies of scale that bring greater SaaS efficiencies. Don’t let go of it easily, but if you must, price extra for it.
Who Owns the Data?
Many products focus only on the transactional set of functionality, leaving the analytics side to their customers. In an on-premise scenario, where the data resides in the customers’ facilities, ownership of the data is clear. Customers are free to slice & dice the data as they please. When that data is hosted, particularly in a multi-tenant scenario where multiple customers’ data lives in the same database, direct customer access presents significant challenges. Beyond the obvious related security issues is the need to keep your customers abreast of the more frequent updates that occur with SaaS product iterations. The decision is whether you replicate customer data into read-only instances, provide bulk export into their own hosted databases, or build analytics into your product?
All of these have costs - ensure you’re passing those on to your customers who need this functionality.
May I Upgrade Now?
Today, do your customers require permission for you to upgrade their installation? You’ll need to change that behavior to realize another SaaS efficiency - supporting of as few versions as possible. Ideally, you’ll typically only support a single version (other than during deployment). If your customers need to ‘bless’ a release before migrating on to it, you’re doing it wrong. Your releases should be small, incremental enhancements, potentially even reaching continuous deployment. Therefore, the changes should be far easier to accept and learn than the prior big-bang, huge upgrades of the past. If absolutely necessary, create a sandbox for customers to access new releases, but be prepared to deal with the potentially unwanted, non-representative feedback from the select few who try it out in that sandbox.
Wait? Who Are We Targeting?
All of the questions above lead to this fundamental issue: Are tomorrow’s SaaS customers the same as today’s? The answer? Not necessarily. First, in order to migrate existing customers on to your bright, shiny new SaaS platform, you’ll need to have functional parity with the legacy product. Reaching that parity will take significant effort and lead to a big-bang approach. Instead, pick a subset or an MVP of existing functionality, and find new customers who will be satisfied with that. Then, after proving out the SaaS architecture and related processes, gradually migrate more and more functionality, and once functional parity is close, move existing customers on to your SaaS platform.
To find those new customers interested in placing their bets on your initial SaaS MVP, you’ll need to shift your current focus on the right side of the Technology Adoption Lifecycle (TALC) to the left - from your current ‘Late Majority’ or ‘Laggards’ to ‘Early Adopters’ or ‘Early Majority’. Ideally, those customers on the left side of the TALC will be slightly more forgiving of the ‘learnings’ you’ll face along the way, as well as prove to be far more valuable partners with you as you further enhance your MVP.
The key is to think out of the existing box your customers are in, to reset your TALC targeting and to consider a new breed of customer, one that doesn’t need all that you’ve built, is willing to be an early adopter, and will be a cooperative partner throughout the process.
Our next article on SaaS migration will touch on organizational approaches, particularly during the build-out of the SaaS product, and the paradigm shifts your product and engineering teams need to embrace in order to be successful.
AKF has led many companies on their journey to SaaS, often getting called in as that journey has been derailed. We’ve seen the many potholes and pitfalls and have learned how to avoid them. Let us help you move your product into the 21st century. See our SaaS Migration service
Subscribe to the AKF Newsletter
Managing Risk with Technical Due Diligence
February 20, 2018 | Posted By: Greg Fennewald
You should not buy a home without an inspection by a licensed home inspector and you should not buy a used car without having a mechanic check it out for you. Diligence - it just makes good sense. Similarly, it is prudent to include technical diligence as part of the evaluation for a potential technology company investment.
Diligence Informs Risk Management
Private equity and venture capital firms typically evaluate many areas preceding a potential investment. The business case, legal structure, competitive analysis, product strategy, financial audits and contractual landscape are all examples of diligence deemed necessary prior to an investment. A company with a great product but three years left on an extremely expensive office lease will probably have a lower value. Breaking the lease or living with it until the term expires means higher costs and thus lower EBITDA. A hot start up with an inexperienced CFO that has run on cash-based accounting from day 1 and is rapidly approaching $6 million in annual revenue needs to move to accrual-based accounting. That takes time and effort and possibly a talent search - this affects the value of the investment.
But what about the technical underpinnings of the product itself? A company with a solitary production database and a marketing analyst with access to directly query that database is likely headed for performance and availability incidents. Single points of failure create a high probability of non-availability. Solutions that don’t allow for seamless and elastic scalability may run into either capacity or cost of operations problems.
Preventing these incidents and altering the conditions that enabled them to exist takes time and effort. All of these assessment areas boil down to risk management. Further, understanding the cost of fixing these solutions helps a company understand their true cost of investment. Your investment includes not just the “PIC” or capital that you put into the company - it also includes all the costs to ensure continuing operations of the product that enables that company. A comprehensive diligence including technical diligence will prepare the investor to make an informed business decision - know the risks and adjust the value proposition accordingly.
Technology Risk Areas
Technology risks can be grouped into four broad areas - Architecture, Process, Organization, and Security. Each area has several subordinate themes.
Architecture - subordinate themes are availability, scalability, cost control.
• Commodity hardware - Corollas, not Carreras
• Horizontal scalability - scale out, not up
• Design for monitoring - see issues before your customers do
• N+1 design - everything fails eventually
• Design for rollback - minimize the impairment
• Asynchronous design - stateless systems
Process - subordinate themes are engineering, operations, and problem management
• Product management - a product owner should be able add, delay, or deprecate features from an upcoming release
• Metrics - development teams should use effort estimation and velocity measurement metrics to monitor progress and performance
• Development practices - developers should conduct code reviews and be held accountable for unit testing
• Incident management - incidents should be logged with sufficient details for further follow up
• Post mortem - a structured process should be in place to review significant problems, assign action items, and track resolution
• PDLC - the Product Development Lifecycle should align with the company’s desires to be customer driven (not desirable in most cases) or market driven (resulting in the highest returns and fastest saturation of any market)
Organization - subordinate themes are PDLC (Product Development Lifecycle) structure, product alignment and team composition
• Product or Service Alignment - cross functional teams should be aligned by product or service and understand how their efforts complement business goals
• Agile or Waterfall - if “discovering” the market or choosing the best possible product for a market then Agile is appropriate - if developing to well defined contracts then waterfall may be necessary.
• Team composition - the engineer to QA tester ratio should ideally exceed 3.5:1. Significant deviations may be a sign or trouble or a harbinger of problems to come
• Goals - measurable goals aligned with business priorities should be visible to all with clear accountability
Security - subordinate themes are framework, prevention, detection and response
• Framework - use NIST, ISO, PCI or other regulatory standards to establish the framework for a security program. The standards do overlap, think it through and avoid duplication of effort.
• Policies in place - a sound security program will have multiple security related policies such as employee acceptable use, access controls, data classification, and an incident response plan.
• Security risk matrix - security risks should be graded by their impact, probability of occurrence, and controlling measures
• Business metrics - analysis of business metrics (revenue per minute, change of address, checkout value anomalies, file saves per minute, etc) can develop thresholds for alerting to a potential security incident. Over time, the analysis can inform prevention techniques.
• Response plan - a plan must be in place and must have regular rehearsals.
Technology Cost Impact on Investment Value
Technology costs can have a significant impact on the overall investment value. Strengths and weaknesses uncovered during a technical diligence effort help the investor make the best overall business decision.
Technology costs are normally captured in 2 areas of the income statement, cost of revenue (production environment and personnel) and operating expenses (software development). Technology costs can also affect depreciation (server capital purchases) and amortization (pre-paid licensing and support). These cost areas should be reviewed for unusual patterns or abnormally high or low spend rates. It is also important to understand the term of equipment purchase, software licensing, and support contracts - spend may be committed for several years.
Cost Cautions - tales from the past
• Support for production equipment purchased from a 3d party because the equipment is old and no longer supported by the OEM. Use equipment as long as possible, but don’t risk a production outage.
• Constant software vendor license audits - they will find revenue, but the technology team that leaves their company vulnerable on a recurring basis is likely to have other significant issues.
• Lack of an RFP or benchmarking process to periodically assess the cost effectiveness of hardware, software, hosting, and support vendors. Making a change in one of these areas is not simple, but the technology team should know how much they should pay before a change is better for the company.
A technical diligence effort should also identify the level of technical debt and quantify the amount of engineering resources dedicated to servicing the technical debt.
Technical debt is a conscious choice to take a shortcut in the technology arena - the delta between the desired or intended way and quicker way. The shortcut is usually taken for time to market reasons and is a sound business decision within reason. Technical debt is analogous in many ways to financial debt - a complete lack of it probably means missed business opportunities while an excess means disaster around the corner.
Just like financial debt, technical debt must be serviced, and it is serviced by the efforts of the engineering team - the same team developing the software. AKF recommends 12% to 25% of engineering effort be spent servicing technical debt. Whether that resource allocation keeps the debt static, reduces it, or allows it to grow depends upon the amount of technical debt. It is easy to see how a company delinquent in servicing their technical debt will have to increase the resource allocation to deal with it, reducing resources for product innovation and market responsiveness.
Put It All Together
The investor has made use of several specialists in an overall diligence effort and is digesting the information to zero in on the choice to invest and at what price. The business side looks good - revenue growth, product strategy, and marketing are solid. The legal side has some risks relating to returning a leased office space to its original condition, but the lease has 5 years to run. Now for technology;
• Tech refresh is overdue, so additional investment is needed or a move to the cloud accelerated - either choice puts pressure on thin margins.
• An expensive RDBMS is in use, but the technology team avoids stored procedures and keeps their SQL as vanilla as possible - moving to open source is doable.
• Technical debt service is constantly derailed by feature requests from sales and marketing. Additional resources, hired or contracted, will be needed and will raise the technology run rate. More margin pressure.
• Conclusion - the investment needed to address tech refresh and technical debt changes the investment value. The investor lowers the offer price.
Interested in learning more about technical due diligence? Here are some due diligence do’s and don’ts.
How AKF can help
AKF has conducted hundreds of technical due diligence studies over the last 10 years. One would want an attorney for a legal diligence effort and one would want a technologist for a technical due diligence. AKF does technology right. Read more about our technical due diligence offerings here.
Subscribe to the AKF Newsletter
Your Site is as Important as the Product You Sell - Recent Example from Saddleback Leather
February 7, 2018 | Posted By: Pete Ferguson
If you have a premium product, at a premium price, it’s unlikely you would sell it out of a rundown, poorly lighted store that smells vaguely like stale meat. Yet somehow many of us forget to apply that same reasoning when it comes to selling our products online. The availability - and look and feel of your presence online - is your store front.
I’ve long been a fan of Saddleback Leather. However, their motto: “They’ll fight over it when you’re dead” fell short in January. You see, it’s hard for your family to fight over the thing that you can’t even purchase… Saddleback Leather had a completely foreseeable, and absolutely preventable outage. From Dave Munson, the CEO:
“I’ve always dreamt of one day having a really fast and easy website for you to enjoy. So, we decided to leave our slow and clunky old website and start building one on a new and different platform. The contract expired Dec. 30th, 2017, but the new site wasn’t fully ready yet. We flipped the switch anyways and all Gehenna broke loose. The super fast, fun and easy website… wasn’t fast, fun or easy and we wasted a ton of time and irritated the heck out of our favorite people. People couldn’t check out, set up accounts or even add stuff to their carts. So, we paid a ton of money to get our old slow and clunky back again until we get this new site just right. “
To make up for it, last week I received an apology letter sent by “El Presidente” Munson with an 11% off coupon. 11 % because Munson has recently celebrated 11 years of marriage to his wife, Suzzette. As a side note, it’s a perfect example of how to apologize to your customers when you screw up. This guy made a mistake, is paying for it by paying for his old site while continuing to develop the new, and is giving customers discounts with a coupon aptly titled: “IAMSORRY.”
Ironically, as a fan and customer, I don’t recall the old site being slow or terrible. On the contrary, when I visited early in January, their “new and improved” site felt clunky and disjointed. The wrong images were coming up for products and many items reported being “not available.”
In the world of environmental health and safety, “all accidents are preventable” is the holy grail of compliance. We believe that with the right forethought and planning, the same is true with virtually all products and storefronts online.
At AKF we are fond of saying “an accident is a terrible thing to waste.” While the exact details of what went wrong are not disclosed, the motives were:
- They took a concept that presumably worked great in beta testing live without testing under full load.
- Munson made the decision to push out something that wasn’t yet great to save money by exiting a contract by the end of the year.
For similar content on our Growth Blog, click here
The result is lost sales from when the site was down, lost customers who may have been trying the website for their first time and won’t be back, an 11% haircut of sales for the next week, and a fan base - many of whom have been very vocal on FaceBook - that is verbally expressing their disdain to see the company they have counted on for unquestioned quality in the past didn’t settle for quality first this time.
The days of customers quickly forgiving their favorite retailers for not being equally as great online are waning. Make sure you have a solid strategy and the right expertise in your corner when it comes to greatly affecting your customer’s ability to purchase or better interact with your product.
Experiencing growing pains? AKF is here to help! We are an industry expert in technology scalability and due diligence. Put our 200+ years of combined experience to work for you today!
Get this article and others like it by signing up for our newsletter.
Subscribe to the AKF Newsletter
There Are Always Plenty of Incidents from Which To Learn
January 13, 2018 | Posted By: Dave Swenson
Sorry, False Alarm…
On January 13, 2018, what felt like an episode of Netflix’s “Black Mirror” unfolded in real life. Just after 8 in the morning, residents and visitors of Hawaii were woken up to the following startling push notification:
Thankfully, the notification was a false alarm, finally retracted with a second notification nearly 40 interminable minutes later.
The amazing, poignant and sobering stories that occurred from those 40 minutes, included people:
- determining which children to spend their last minutes with,
- abandoning their cars on streets,
- sheltering in a lava tube,
- believing and acting as we all would if we believed the end was here.
Unfortunately, this wasn’t a Black Mirror episode and paralyzed an entire state’s population. Thankfully, the alarm was a false one.
A Muted President
As President Trump took office, he introduced a new means for a President to reach his constituents—Twitter, averaging 6 to 7 tweets per day during his first year. On November 2, 2017, many bots that were created to closely monitor the tweets of @realDonaldTrump started reporting that the account no longer existed. Clicking to his account took the user to the above error page.
For a deafening 11 minutes, the nation was unable to listen to its leader, at least via Twitter.
The Hawaiian false alarm was sent by the state’s Emergency Management Agency. Their explanation of the incident was that during a shift change, an employee clicked “the wrong button” while running a missile crisis test, then subsequently clicked through a confirmation prompt (“Are you sure you want to tell 1.5 million people this?”).
Twitter employees had reportedly tried for years to get management attention on ensuring accounts weren’t deleted without proper vetting. The company typically used contractors in the Philippines and Singapore to handle such account administration; Trump’s account was deleted by a German contract worker on his last day at Twitter. Acting on yet-another-Trump-complaint, believing such an important account couldn’t be suspended, the worker’s last action for Twitter was to click the suspend button, and then walked out of the building causing the Twitterverse to read far more into the account’s disappearance than they should have.
In both of these situations, the immediate focus was on the personnel involved in the incident. “Who pushed the button?” is typically always one of the initial questions. Assumptions that a new employee, or rogue worker were behind the incident are common, and both motive and intelligence of all involved are under inspection.
We at AKF Partners constantly preach “An incident is a terrible thing to waste”. Events such as these warp the known reality into “How the shit can that happen??”, causing enough alarm to warrant special attention and focus, if not panic. Yet, all too often we see teams searching frantically to find any cause, blame the most obvious, immediate factor, declare victory, and move on.
“Who pushed the button?” is only one of many questions.
Toyota’s Taichi Ohno, the father of Lean Manufacturing, recognized his team’s habit of accepting the most apparent cause, ignoring (wasting) other elements revealed by an incident, potentially allowing it to be eventually repeated. Ohno (the person, not the exclamation typically uttered during an incident) emphasized the importance of asking “5 Why’s” in order to move beyond the most obvious explanation (and accompanying blame), to peel the onion diving deeper into contributory causes.
Questions beyond the reflexive “What happened?” and “Who did it?” relevant to the false alarm and erroneous account deletion incidents include:
- Why did the system act differently than the individual expected (is there more training required, is the user interface a confusing one)?
- Why did it take so long to correct (is there no playbook for detecting / reversing such a message or key account activity)?
- Why does the system allow such an impactful event to be performed unilaterally, by a single person (what safeguards should exist requiring more than one set of hands?)
- Why does this particular person have such authorization to perform this action (should a non-employee have the ability to delete such a verified, popular and influential account)?
- Why was the possibility of this incident not anticipated and prevented (why were Twitter employee requests for better safeguards ignored for years, why wasn’t the ease of making such a mistake recognized and what other similar mistake opportunities are there)?
Both of these incidents have had an impact far beyond those directly affected (Hawaiian inhabitants or Trump Twitter followers), and have shed light on the need to recognize the world has changed and policies and practices of old might not be enough for today. The ballistic missile false alarm revealed that more controls need to be placed on all mass communication, but also that Hawaii (or anywhere/anyone else) is extremely unprepared for the unthinkable. The use of Twitter as a channel for the President now raises questions over the validity of it as a Presidential record, asks who should control such a channel, and raises concerns on what security is around the President’s account?
Ask 5 Whys, look beyond the immediate impact to find collateral learnings, and take notice of all that an incident can reveal.
AKF Partners have been brought in by over 400 companies to avoid such incidents, and when they do occur, to learn from them. Let us help you.
Subscribe to the AKF Newsletter
Hosting Lessons from Harvey and Irma
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
Subscribe to the AKF Newsletter
Monitoring for Early Fault Detection
July 6, 2017 | Posted By: AKF
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather they tell you something appears to be abnormal and should be investigated. The early warning aspect can help reduce detection time and thus shorten overall MTTR.
At eBay, we had near real time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the network operations center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nose dive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine. Our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
The World Cup is the most popular football (soccer) event in the world, arguably the most popular sporting event worldwide. Broadcast matches draw huge audiences in the UK and the broadcast is typically aired without commercials until half time. There was a documentary on the UK electrical utility system preparing for a broadcast. As soon as half time commenced, a large proportion of the viewing audience visited the loo and hit the lever on their electric tea kettles. Thankfully, the documentary was about the electric utility and not sewage! The step function increase in load would cause significant problems for the utility, straining its ability to maintain voltage and frequency. The utility had prepared for this situation by staging “peakers” – diesel generators that can be brought online to help serve the increased load. Utility grid stability is akin to a Goldilocks Zone – too much is bad, too little is bad, just right is best. The operations center for the utility did not want to bring the generators on too early or too late. They needed real time information on their customers. The solution was to have a TV tuned to the World Cup broadcast in the operations center, enabling the engineers to stage on generators immediately prior to half time and stage them off as the load increase subsided. Being paid to watch the World Cup was certainly an unintended benefit!
How could your company react in a manner like the UK power utility? A sponsored event or viral campaign could overload your systems. Consider using elastic compute in the cloud for your peak demand – the equivalent to the diesel generators use for the World Cup. Scale up for the spikes in demand, then shut it down afterwards. Own the base, rent the peak. Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
April 3, 2017 | Posted By: AKF
A topic that often results in great debate is “how to measure engineers?” I’m a pretty data driven guy so I’m a fan of metrics as long as they are 1) measured correctly 2) used properly and 3) not taken in isolation. I’ll touch on these issues with metrics later in the post, let’s first discuss a few possible metrics that you might consider using. Three of my favorite are: velocity, efficiency, and cost.
- Velocity – This is a measurement that comes from the Agile development methodology. Velocity is the aggregate of story
points (or any other unit of estimate that you use e.g. ideal days) that engineers on a team complete in a sprint. As we will
discuss later, there is no standard good or bad for this metric and it is not intended to be used to compare one engineer to
another. This metric should be used to help the engineer get better at estimating, that’s it. No pushing for more story points
or comparing one team to another, just use it as feedback to the engineers and team so they can get more predictable in
- Efficiency – The amount of time a software developer spends doing development related activities (e.g. coding, designing,
discussing with the product manager, etc) divided by their total time available (assume 8 – 10 hours per day) provides the
Engineering Efficiency. This is a metric designed to see how much time software developers are actually spending on
developing software. This metric often surprises people. Achieving 60% or more is exceptional. We often see dev groups
below 40% efficiency. This metric is useful for identifying where else engineers are spending their time. Are there too many
company meetings not directly related to getting products out the door? Are you doing too many HR training sessions, etc?
This metric is really for the management team to then identify what is eating up the nondevelopment
time and get rid of it.
- Cost – Tech cost as a percentage of revenue is a good cost based metric to see how much you are spending on technology.
This is very useful as it can be compared to other tech (SaaS or other webbased companies) and you can watch this metric change over time. Most startups begin with their total tech cost (engineers, hosting, etc) at well over 50% of revenue but this should quickly reduce as revenue grows and the business scales. Yes, scaling a business involves growing it cost effectively. Established companies with revenues in the tens of millions range usually have this percentage below 10%. Very large companies in the hundreds of millions in revenue often drive this down to 57%.
Now that we know about some of the most common metrics, how should they be used? The most common way managers and executives want to use metrics is to compare engineers to each other or compare a team over time. This works for the Efficiency and the Cost metrics, which by the way are primarily measurements of management effectiveness. Managers make most of the cost decisions including staffing, vendor contracts, etc. so they should be on the hook to improve these metrics. In terms of product out the door as measured by story points completed each sprint a.k.a. Velocity, as mentioned above, is to be used to improve estimates, not try to speed up developers. Using this metric incorrectly will just result in bloated estimates, not faster development.
An interesting comparison of developers comes from a 1967 article by Grant and Sackman in which they stated a ratio of 28:1 for the time required by the slowest versus the fastest programmer to complete a task. This has been a widely cited ratio but a paper from 2000 revised this number to 4:1 at the most and more likely 2:1. While a 2x difference in speed is still impressive it doesn’t optimize for the overall quality of the product. An engineer who is very fast and with high quality but doesn’t interact with the product managers isn’t necessarily the overall most effective. My point is that there are many other factors to be considered than just story points per release when comparing engineers.
Subscribe to the AKF Newsletter
PDLC or SDLC
April 3, 2017 | Posted By: AKF
As a frequent technology writer I often find myself referring to the method or process that teams use to produce software. The two terms that are usually given for this are software development life cycle (SDLC) and product development life cycle (PDLC). The question that I have is are these really interchangeable? I don’t think so and here’s why.
Wikipedia, our collective intelligence, doesn’t have an entry for PDLC, but explains that the product life cycle has to do with the life of a product in the market and involves many professional disciplines. According to this definition the stages include market introduction, growth, mature, and saturation. This really isn’t the PDLC that I’m interested in. Under new product development (NDP) we find a defintion more akin to PDLC that includes the complete process of bringing a new product to market and includes the following steps: idea generation, idea screening, concept development, business analysis, beta testing, technical implementation, commercialization, and pricing.
Under SDLC, Wikipedia doesn’t let us down and explains it as a structure imposed on the development of software products. In the article are references to multiple different models including the classic waterfall as well as agile, RAD, and Scrum and others.
In my mind the PDLC is the overarching process of product development that includes the business units. The SDLC is the specific steps within the PDLC that are completed by the technical organization (product managers included). An image on HBSC’s site that doesn’t seem to have any accompanying explanation depicts this very well graphically.
Another way to explain the way I think of them is to me all professional software projects are products but not all product development includes software development. See the Venn diagram below. The upfront (bus analysis, competitive analysis, etc) and back end work (infrastructure, support, depreciation, etc) are part of the PDLC and are essential to get the software project created in the SDLC out the door successfully. There are non-software related products that still require a PDLC to develop.
Do you use them interchangeably? What do you think the differences are?
Reach out to AKF
Subscribe to the AKF Newsletter
1 2 >