August 9, 2017 | Posted By: AKF.marty
We have a saying in AKF Partners that “an incident is a terrible thing to waste”. When things go poorly in a firm, stakeholders (shareholders, partners, employees) pay a price. Having already paid a price, the firm must maximize the learning opportunity the incident presents. Google wasted such a learning opportunity by failing to capitalize on an incredible teaching moment with the termination of James Damore (the author of the sometimes called “Anti-Diversity Manifesto”). While Google seems to have “done the right thing” by firing Damore, it is unclear that they “did it for the right reason”. The “right reason” here is that diversity is valuable to a company because it increases innovation and in so doing increases the probability of success. Further, diversity is hard to achieve, takes great effort and can easily be derailed with very little effort. Companies simply cannot allow employees to work at odds with incredibly valuable diversity initiatives.
Diversity Drives Innovation and Success
My doctoral dissertation journey introduced me to diversity and its beneficial effects on innovation, time to market, and success within technology product firms. Put simply, teams that are intentionally organized to highlight both inherent (traits with which we are born) and acquired (traits we gain from experience) diversity achieve higher levels of innovation. Research published in the Harvard Business Review confirms this, indicating that diverse teams out innovate and out-perform other teams. Diverse teams are more likely to understand the broad base of needs of the market and clients they support. Companies with very diverse management teams are 35% more likely to have financial returns above the mean for their industry. Firms with women on their board on average have a higher ROE and net income than those that do not.
Differences in perspective and skills are things we should all strive to have in our teams. As we point out in The Art of Scalability, these differences increase beneficial cognitive conflict. Increases in cognitive conflict opens a range of strategic possibilities that in turn engender higher levels of success for the firm.
We have for too long allowed the struggle for diversity to be waged on the battleground of “fairness”. The problem with “fair” is that what is “fair’ to one person may seem inherently unfair to another. “Fair” is subjective and “fair” is too often political. “Success” on the other hand is objective and easily measured. Let’s move this fight to where it belongs and embrace diversity because it drives innovation and success. After all, anyone who can’t get behind winning, doesn’t deserve to be on a winning team.
Achieving Diversity is Hard
While the value of diversity is high, the cost to achieve it is also unfortunately high – especially within software teams. As my colleague Robin McGlothin recently wrote, the percentage of computer science degrees awarded to women over the last 25 years is declining. Most other minorities are similarly underrepresented in the field relative to their corresponding representation in the US population.
As in any market with high demand and low supply, companies need to find innovative ways to attract, grow and retain talent. These activities may include special mentoring programs, training programs, or scholarships at local universities meant to attract the group in question. These approaches may seem “unfair” to some, but they are in truth capitalism at its best - the application of market forces to solve a supply and demand problem. When a skill or trait is under high demand and short supply, the cost for that skill goes up. The extra activities above are nothing more than an increased cost to attract and retain the skills we value.
Companies desiring to achieve success in innovation through diversity MUST approach it in a steely, single-minded fashion. Any dissent as it relates to outcomes detracts from the probability of success. How many people with diverse backgrounds will leave or have left Google because of Damore’s missive? How many candidates won’t accept offers? Losing even one great candidate is an unacceptable additional cost given the already high cost to achieve success.
The Bottom Line
Structuring organizations and building cultures that tap the power of inherent and acquired diversity pays huge dividends for firms in terms of innovation, time to market, ROE and net income. While the rewards are high, the cost to achieve these benefits are also high. Success requires a steely, single-minded pursuit of diversity excellence.
The successful company will allow no dissent on this topic, as dissent makes the firm less attractive to the ideal candidate. Given a constrained supply under high demand, the candidate can and should go to the most welcoming environment available.
Put simply, Google did the right thing in firing Damore. But they failed to fully capitalize on the unfortunate event. The right answer, when asked about the reason for firing, would look something like this: “We recognize that diversity in experiences, background, gender and race drives higher levels of innovation and greater levels of success. Our culture will not tolerate employees who are not aligned with creating stakeholder value.”
Interested in driving innovation and time to market in your product and engineering teams? AKF Partners helps companies create experientially diverse product teams aligned with business outcomes to help turbo-charge performance.
August 1, 2017 | Posted By: AKF.daveswenson
We all suffer from various cognitive biases, those mental filters or lenses that alter or warp the reality around us. With the election of 2016, one particular bias has gained widespread attention - the Dunning-Kruger Effect. Defined in wikipedia as:
“...a cognitive bias, wherein persons of low ability suffer from illusory superiority when they mistakenly assess their cognitive ability as greater than it is.”
(If you’ve ever wondered about the behind-the-scenes process of creating Wikipedia content, look at this entertaining discussion.)
In 1999, while a professor at Cornell, David Dunning joined Justin Kruger to co-author a paper titled “Unskilled and Unaware of It: How Difficulties in Recognizing One’s Own Incompetence Lead to Inflated Self-Assessments”, based on studies indicating that people who are incompetent in an area are typically too incompetent to know they are incompetent. Or, simply put, we are often in a position where we don’t know what we don’t know, and therefore cannot judge our level of expertise in a particular area.
This effect or bias, is also known as the ‘Lake Wobegon effect’, or ‘illusory superiority’, and is closely tied to the Peter Principle. Donald Rumsfeld put it this way:
There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.
And in John Cleese’s words, stupid people do not have the capability to realize how stupid they are.
The story of how Dunning came to posit the D-K Effect is an amusing one. He read about an unusual bank robbery that occurred in Pittsburgh. What was unusual was that the robber, McArthur Wheeler, made absolutely no effort to disguise himself, and in fact, looked and smiled directly into the security cameras. Yet, he was surprised to quickly be arrested, telling authorities “...but I used the juice!”.
Wheeler told the police that they couldn’t arrest him based on the security videos, as wearing lemon juice, he was of course invisible. He had been told coating your face with lemon juice makes you invisible to cameras, perhaps similar to using lemon juice for invisible ink. Wheeler had even gone as far as to test the theory by taking a Polaroid picture of himself after coating his face with the lemon juice, and sure enough, his face didn’t appear in the print. The police never were able to explain this, but likely Wheeler was as incompetent at photography as he was at burglary. Clearly, Wheeler was too incompetent at burglary to know he was incompetent.
So, does Dunning-Kruger exist in the technology world? Absolutely…
Just as a typical driver believes their driving skills are Formula 1 worthy, until they’re on a track getting blown past by an inferior car driven by someone who has far better braking and cornering skills, we all tend to underestimate what is possible. We live in our own bubbles and are comparing our abilities only against those who also reside in our bubbles. Therefore, we don’t know what we don’t know - we don’t know there are far better drivers outside our bubbles.
You may think your organization is at the peak of efficiency, until you bring someone in from a Google, Facebook, Amazon, etc. who reveals what the true peak really can be - what fully Agile processes and cultures can do to reduce time to market, how effective SREs and DevOps can be, how to remain innovative, what continuous delivery can do to, etc.
AKF firmly believes in “Experiential Diversity” to cross-pollinate teams, injecting new DNA into a company or bubble that was grown in a different bubble. We see numerous companies with very static personnel, where the average employee tenure is over 15 years. There have been tremendous changes in the technology world in 15 years, and while reading a book or attending a conference on new processes brings some exposure to the latest and greatest, it isn’t enough. It is incredibly important to continually bring new blood into an organization, and to purposely tap into that diversity of processes, technologies, organizational structures that comes with the new blood.
Other techniques to mitigate the effect of D-K in technology, of eliminating our personal and organizational biases include:
- 360 degree reviews - Dunning himself has said “The road to self-insight runs through other people”. What better way to get feedback than from periodic 360 degree reviews?
- Code reviews - The likelihood that some percentage of your developers suffer from D-K means that you’re dependent upon code reviews to flush out their incompetence. Just make sure you’re not pairing up two D-K developers to perform the review!
- Planning Poker - requiring, in true Agile fashion, a team to estimate a task or project reduces the chance of that D-K estimate from torpedoing your development planning.
- Soliciting advice - the increasing utilization of open source software means there isn’t a vendor, with hopefully solid expertise, to turn to for advice. Instead of solely relying upon your own developer who only knows how to spell say Cassandra, leverage the appropriate OSS community. Just beware that you might not know whether that solicited advice is good advice.
- Proper interviewing - Ensure your interviewing process can weed out “confident idiots”. Consider planting bogus questions to gauge a candidate’s reactions, like Jimmy Kimmel’s “Lie Witness News”. At a minimum, require team interviewing and consensus for new candidates.
In short, Dunning-Kruger is as rampant within the Technology sector as it is anywhere else, if not even more so. Expect it to be present in your organization, and guard against it. Look at it within yourself as well. Who amongst us hasn’t experienced the shock of discovering we’ve failed a test that we actually thought we’d aced? We all have suffered at one time or another from the Dunning-Kruger effect.
July 19, 2017 | Posted By: AKF.robin
We hear every day that more and more jobs are disappearing, yet the technology job sector cannot keep up with the unprecedented demand. So why are women falling behind in this growing career track?
When we look at the percentage of STEM bachelor’s degrees awarded to female students for the last two decades, based on NSF statistics, we find there are no gender difference in the bio sciences, the social sciences, or mathematics, and not much of a difference in the physical sciences. Great news for women scientists. The only STEM fields in which men genuinely outnumber women are computer science and engineering. What? Why the stagnant numbers in computer science?
At the PhD. level, women have clearly achieved equity in the bio sciences and social sciences, are nearly there (40 percent) in mathematics and the physical sciences, and are “over-represented” in psychology (78 percent). More good news. Again, the only fields in which men greatly outnumber women are computer science and engineering. Why no growth?
As I started my research for this blog post, I was pleasantly surprised to find women scientist representation growing in almost all aspects of STEM. And at the same time, disheartened to find my major, computer sciences, is stagnate in growth for women over the past two decades.
What’s different in the computer science & engineering aspects of STEM that seem to hold women back? There are many conflicting reports on how our environment and upbringing are sublimely programming women away from engineering and mathematics. We were told from an early age, math and science are for boys.
My mother was a pioneer and a strong female leader. She holds a PHD in Biochemistry, served as President for Academic Affairs and Provost at Salem International University. She demanded her daughters rise to any challenge and deliver to the best of our abilities. Never once did I doubt I had amazing talents and just needed to get busy using them. So, is it nature or nurture that helped me stay with STEM? Maybe a little of both.
I saw an article recently in the WSJ on Salesforce.com, where CEO Mark Benioff, is focused on ensuring women are represented fairly at every level in his company. Taking proactive steps like SFDC.com, to open doors for women, rings truer to me then the “poor little girl” theories on how to increase female participation in computer science and engineering.
The cloud-computing giant is two years into a companywide “women’s surge” in which managers must consider women when filling open positions at every level. They are also examining salaries for every role in the company to ensure women and men are paid equally. And finally, ensuring that women make up at least 30% of attendees at management summits or onstage roles at keynote presentations.
With some nurturing at home during early years of development and progress in the corporate landscape leveling the playing field, I believe we are finally set to see an upward trajectory for the last two laggard categories in STEM.
Future women engineers can see a world where their hard work and discipline will pay off, a road-map to success if you will. We no longer need to break through the old stereotypes, running faster and jumping higher to be considered half as good as our male counterparts. Instead, there will be fair and equal opportunity for career advancement for women engineers and computer scientists.
I would submit some of the best technology leaders today are women. My personal experience afforded me the opportunity to work with several top female technology executives. One of the best leaders I worked for is a power house that broke all the stereotypes, and worked circles around her male counterparts. As I look back and try to understand what propelled these successful women, they all possess some classic traits that are needed in any leadership role.
Collaboration. Women are skilled collaborators, able to work with all different people. This is an important quality for any professionals, as cross-departmental collaboration is key. Technology impacts every function in modern business, and those most successful will be able to collaborate with all different teams and individuals.
Communication. For many of the same reasons, technologist must also be strong communicators. Communication is an area where many women traditionally excel and it’s an important quality to have. For example, communicating with the sales department may be different from communicating with the IT department. Good technology leaders will be able to speak to everyone.
Perspective. Being able to inspire a team and see the big picture are both equally important. A technology leader must be able to not only collect and analyze data but draw meaningful insights and understand what it means for the company. The ability to holistically view a situation is a competitive differentiation for organizations as well as a positive attribute that many women possess.
In the past, women had to fight a little harder to push through the barriers that have prevented women from entering STEM, but the tide is turning. In today’s new business paradigm, with a strong technology sector jobs forecast, it’s a perfect time for young women to enter computer science and engineering field.
And to help drive this point home, President Donald Trump signed two laws that authorize NASA and the National Science Foundation to encourage women and girls to get into STEM fields. The Inspire Act directs NASA to promote STEM fields to women and girls, and encourage women to pursue careers in aerospace. The law gives NASA three months to present two congressional committees with its plans for getting staff—think astronauts, scientists and engineers—in front of girls studying STEM in elementary and secondary schools. The full name of the law is the Inspiring the Next Space Pioneers, Innovators, Researchers, and Explorers Women Act. The second law is the Promoting Women in Entrepreneurship Act. It authorizes the National Science Foundation to support entrepreneurial programs aimed at women.
The stage has been set – go forth future astronauts, scientist, coder girls! Let’s rock the world.
July 6, 2017 | Posted By: AKF
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather they tell you something appears to be abnormal and should be investigated. The early warning aspect can help reduce detection time and thus shorten overall MTTR.
At eBay, we had near real time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the network operations center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nose dive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine. Our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
The World Cup is the most popular football (soccer) event in the world, arguably the most popular sporting event worldwide. Broadcast matches draw huge audiences in the UK and the broadcast is typically aired without commercials until half time. There was a documentary on the UK electrical utility system preparing for a broadcast. As soon as half time commenced, a large proportion of the viewing audience visited the loo and hit the lever on their electric tea kettles. Thankfully, the documentary was about the electric utility and not sewage! The step function increase in load would cause significant problems for the utility, straining its ability to maintain voltage and frequency. The utility had prepared for this situation by staging “peakers” – diesel generators that can be brought online to help serve the increased load. Utility grid stability is akin to a Goldilocks Zone – too much is bad, too little is bad, just right is best. The operations center for the utility did not want to bring the generators on too early or too late. They needed real time information on their customers. The solution was to have a TV tuned to the World Cup broadcast in the operations center, enabling the engineers to stage on generators immediately prior to half time and stage them off as the load increase subsided. Being paid to watch the World Cup was certainly an unintended benefit!
How could your company react in a manner like the UK power utility? A sponsored event or viral campaign could overload your systems. Consider using elastic compute in the cloud for your peak demand – the equivalent to the diesel generators use for the World Cup. Scale up for the spikes in demand, then shut it down afterwards. Own the base, rent the peak. Use business metric monitors to detect workload shifts.
April 19, 2017 | Posted By: AKF
AKF Scalability Workshops
Our workshop is designed for technology executives who are responsible for delivering highly available and highly scalable technical platforms & products. The principles we share can be applied to large organizations and start-ups alike. Our principles are technology-agnostic – we believe you can successfully scale with almost any technology if key concepts are followed. During our two-day workshop, you’ll participate in sessions that integrate our experience, research, and the work we’ve done with over 400 clients since 2007.
How is the workshop structured?
The workshop is delivered in 14 collaborative sessions over the 2-day event. While a member of the AKF team will lead the discussion in each session, much of the interaction comes from the participants themselves. We keep the session size limited (maximum of 25 attendees) so that each attendee can be an active part of the conversation, share experiences, and ask questions from other executives who have been in your shoes. You’ll leave the workshop with principles, tools, and examples that you can continuously apply to your platform and organization.
Who should attend the workshop?
Our event is designed for current CTOs, VPs of Engineering, Chief Architects, and other technology executives who want to improve their management, leadership, and technology skills. We help companies scale their technology and product platforms. Although nearly any technical organization would benefit from the lessons shared in the workshop, our sessions will provide the most value to companies that use technology to deliver their core product or service (e.g. SaaS, eCommerce).
What topics are covered in the workshop?
• The CTO Role: A discussion on the diversity of expectations and responsibilities from the 400 companies we have worked with at AKF Partners.
• The Right People & Roles: Ensuring the right talent is placed in positions for success.
• Management & Leadership: The skills of a transformational leader and highly effective manager.
• Conflict & Innovation: A discussion of good and bad conflicts in organizations and how to increase innovation.
• Multidisciplinary Agile Teams: Building innovative teams with diverse experience and skills.
• Team Goals & KPIs: Setting goals, metrics, and KPIs for Agile teams to ensure success.
• The Experiential Chasm: The widening gap between business leaders and technology leaders and how to close it.
• Service Delivery Mindset: The most successful technology organizations are structured with a service oriented mindset and we will discuss how to transform your organization and mindset.
• AKF Risk Model: Our viewpoint of risk and how to manage it successfully in your architecture, people, and processes.
• Highly Scalable Architectures: An in-depth look at creating highly scalable and available architectures
• AKF Scale Cube: Our approach to designing highly scalable architectures.
• Creating Fault Isolation: The importance of isolation for availability and time to market.
• Architecture Principles: An in-depth look at the top architecture principles and how to apply them.
• Processes for a Learning Organization: The most effective processes to put in place to create a successful learning organization.
Who teaches this workshop?
Workshops are delivered by AKF Managing Partner, Marty Abbott, as well as AKF Partner Drew Morrell. Marty, along with Mike Fisher and Tom Keeven, helped found AKF Partners nine years ago with the goal of leveraging their successes (and failures!) as technology executives to help other companies prepare for and achieve hyper-growth. To date, AKF has helped over 400 companies across 18 countries make progress towards their scalability goals (including many leaders in the internet industry). Marty and Mike have co-authored three books: “The Art of Scalability”, “Scalability Rules”, and “The Power of Customer Misbehavior.”
April 3, 2017 | Posted By: AKF
The Y axis of the AKF Scale Cube indicates that growing companies should consider splitting their products along services (verb) or resources (noun) oriented boundaries. A common question we receive is “how granular should one make a services split?” A similar question to this is “how many swim lanes should our application be split into?” To help answer these questions, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your application should be grouped into a large, monolithic codebases or split up into smaller individual services and swim lanes. You must also keep in mind that splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served focusing their resources on developing a marketable product than by fine tuning their service sizes using the considerations below.
Frequency of Change – Services with a high rate of change in a monolithic codebase cause competition for code resources and can create a number of time to market impacting conflicts between teams including product merge conflicts. Such high change services should be split off into small granular services and ideally placed in their own fault isolative swim lane such that the frequent updates don’t impact other services. Services with low rates of change can be grouped together as there is little value created from disaggregation and a lower level of risk of being impacted by updates.
The diagram below illustrates the relationship we recommend between functionality, frequency of updates, and relative percentage of the codebase. Your high risk, business critical services should reside in the upper right portion being frequently updated by small, dedicated teams. The lower risk functions that rarely change can be grouped together into larger, monolithic services as shown in the bottom left.
Degree of Reuse – If libraries or services have a high level of reuse throughout the product, consider separating and maintaining them apart from code that is specialized for individual features or services. A service in this regard may be something that is linked at compile time, deployed as a shared dynamically loadable library or operate as an independent runtime service.
Team Size – Small, dedicated teams can handle micro services with limited functionality and high rates of change, or large functionality (monolithic solutions) with low rates of change. This will give them a better sense of ownership, increase specialization, and allow them to work autonomously. Team size also has an impact on whether a service should be split. The larger the team, the higher the coordination overhead inherent to the team and the greater the need to consider splitting the team to reduce codebase conflict. In this scenario, we are splitting the product up primarily based on reducing the size of the team in order to reduce product conflicts. Ideally splits would be made based on evaluating the availability increases they allow, the scalability they enable or how they decrease the time to market of development.
Specialized Skills – Some services may need special skills in development that are distinct from the remainder of the team. You may for instance have the need to have some portion of your product run very fast. They in turn may require a compiled language and a great depth of knowledge in algorithms and asymptotic analysis. These engineers may have a completely different skillset than the remainder of your code base which may in turn be interpreted and mostly focused on user interaction and experience. In other cases, you may have code that requires deep domain experience in a very specific area like payments. Each of these are examples of considerations that may indicate a need to split into a service and which may inform the size of that service.
Availability and Fault Tolerance Considerations:
Desired Reliability – If other functions can afford to be impacted when the service fails, then you may be fine grouping them together into a larger service. Indeed, sometimes certain functions should NOT work if another function fails (e.g. one should not be able to trade in an equity trading platform if the solution that understands how many equities are available to trade is not available). However, if you require each function to be available independent of the others, then split them into individual services.
Criticality to the Business – Determine how important the service is to business value creation while also taking into account the service’s visibility. One way to view this is to measure the cost of one hour of downtime against a day’s total revenue. If the business can’t afford for the service to fail, split it up until the impact is more acceptable.
Risk of Failure – Determine the different failure modes for the service (e.g. a billing service charging the wrong amount), what the likelihood and severity of each failure mode occurring is, and how likely you are to detect the failure should it happen. The higher the risk, the greater the segmentation should be.
Scalability of Data – A service may be already be a small percentage of the codebase, but as the data that the service needs to operate scales up, it may make sense to split again.
Scalability of Services – What is the volume of usage relative to the rest of the services? For example, one service may need to support short bursts during peak hours while another has steady, gradual growth. If you separate them, you can address their needs independently without having to over engineer a solution to satisfy both.
Dependency on Other Service’s Data – If the dependency on another service’s data can’t be removed or handled with an asynchronous call, the benefits of disaggregating the service probably won’t outweigh the effort required to make the split.
Effort to Split the Code – If the services are so tightly bound that it will take months to split them, you’ll have to decide whether the value created is worth the time spent. You’ll also need to take into account the effort required to develop the deployment scripts for the new service.
Shared Persistent Storage Tier – If you split off the new service, but it still relies on a shared database, you may not fully realize the benefits of disaggregation. Placing a readonly DB replica in the new service’s swim lane will increase performance and availability, but it can also raise the effort and cost required.
Network Configuration – Does the service need its own subdomain? Will you need to make changes load balancer routing or firewall rules? Depending on the team’s expertise, some network changes require more effort than others. Ensure you consider these changes in the total cost of the split.
The illustration below can be used to quickly determine whether a service or function should be segmented into smaller microservices, be grouped together with similar or dependent services, or remain in a multifunctional, infrequently changing monolith.
April 3, 2017 | Posted By: AKF
A topic that often results in great debate is “how to measure engineers?” I’m a pretty data driven guy so I’m a fan of metrics as long as they are 1) measured correctly 2) used properly and 3) not taken in isolation. I’ll touch on these issues with metrics later in the post, let’s first discuss a few possible metrics that you might consider using. Three of my favorite are: velocity, efficiency, and cost.
- Velocity – This is a measurement that comes from the Agile development methodology. Velocity is the aggregate of story
points (or any other unit of estimate that you use e.g. ideal days) that engineers on a team complete in a sprint. As we will
discuss later, there is no standard good or bad for this metric and it is not intended to be used to compare one engineer to
another. This metric should be used to help the engineer get better at estimating, that’s it. No pushing for more story points
or comparing one team to another, just use it as feedback to the engineers and team so they can get more predictable in
- Efficiency – The amount of time a software developer spends doing development related activities (e.g. coding, designing,
discussing with the product manager, etc) divided by their total time available (assume 8 – 10 hours per day) provides the
Engineering Efficiency. This is a metric designed to see how much time software developers are actually spending on
developing software. This metric often surprises people. Achieving 60% or more is exceptional. We often see dev groups
below 40% efficiency. This metric is useful for identifying where else engineers are spending their time. Are there too many
company meetings not directly related to getting products out the door? Are you doing too many HR training sessions, etc?
This metric is really for the management team to then identify what is eating up the nondevelopment
time and get rid of it.
- Cost – Tech cost as a percentage of revenue is a good cost based metric to see how much you are spending on technology.
This is very useful as it can be compared to other tech (SaaS or other webbased companies) and you can watch this metric change over time. Most startups begin with their total tech cost (engineers, hosting, etc) at well over 50% of revenue but this should quickly reduce as revenue grows and the business scales. Yes, scaling a business involves growing it cost effectively. Established companies with revenues in the tens of millions range usually have this percentage below 10%. Very large companies in the hundreds of millions in revenue often drive this down to 57%.
Now that we know about some of the most common metrics, how should they be used? The most common way managers and executives want to use metrics is to compare engineers to each other or compare a team over time. This works for the Efficiency and the Cost metrics, which by the way are primarily measurements of management effectiveness. Managers make most of the cost decisions including staffing, vendor contracts, etc. so they should be on the hook to improve these metrics. In terms of product out the door as measured by story points completed each sprint a.k.a. Velocity, as mentioned above, is to be used to improve estimates, not try to speed up developers. Using this metric incorrectly will just result in bloated estimates, not faster development.
An interesting comparison of developers comes from a 1967 article by Grant and Sackman in which they stated a ratio of 28:1 for the time required by the slowest versus the fastest programmer to complete a task. This has been a widely cited ratio but a paper from 2000 revised this number to 4:1 at the most and more likely 2:1. While a 2x difference in speed is still impressive it doesn’t optimize for the overall quality of the product. An engineer who is very fast and with high quality but doesn’t interact with the product managers isn’t necessarily the overall most effective. My point is that there are many other factors to be considered than just story points per release when comparing engineers.
April 3, 2017 | Posted By: AKF
The most common point of congestion and therefore barrier to scale that we see in our practice is the database. Referring back to our earlier article “Splitting Applications or Services for Scale”, it is very common for engineers to create scalability along the X axis of our cube by persisting data in a single monolithic database and having multiple “cloned” applications servers retrieve and store data within that database. For young companies this is a very good approach as if done properly it will also eliminate the need for persistence or affinity to a given application server and as a result will increase customer perceived availability.
The problem, however, with this single monolithic data structure is threefold:
- Even with clustering technology (the existence of a second physical system or database that can take the load of the first in the event of failure), failures of the primary database will result in short service outages for 100% of the user community.
- This approach ultimately relies solely on technical improvements in cpu speed, memory access speed, memory access size, mass
storage access speeds and size, etc to insure the companies needs for scale.
- Relying upon (2) above in the extreme cases is not the most cost effective solutions as the newest and fastest technologies come at
a premium to older generations of technology and do not necessarily have the same processing power per dollar as older and/or
smaller (fewer cpus etc) systems.
As we have argued in the aforementioned post, a great engineering team will think about how to scale their platform well in advance of the need to rely solely upon partner technology advances. By making small modifications to our previously presented “Scale Cube”, the same concepts applied to the problem of splitting services for scale can be useful in addressing how to split a database for scale. As with the AKF Services Scale Cube, the AKF Database Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale transactions applied to a database. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic database – a case where all data is located in a single location and all accesses go to this single database.
The X Axis of the cube represents a means of spreading load across multiple instances of a replicated representation of the data. This is the first approach most companies use in scaling databases and is often both the easiest to implement and the least costly in both engineering time and hardware. Many third party and open source databases have native properties or functions that will allow the near real time replication of data to multiple “read databases”. The engineering cost of such an approach is low as typically database calls only need to be identified as a “read” or “write” and sent to the appropriate write database or bank of read databases. The “bank” of read databases should have reads evenly split across this if possible and many companies employ simple 3d party load balancers to perform this distribution.
Included in our Xaxis split are third party and open source caching solutions that allow reads to be split across “cache” hosts before actually reading from a database upon a cache miss. Caching is another simple way to reduce the load on the database but in our experience is not sufficient for hyper growth SaaS sites. If implemented properly, this Xaxis split also can increase availability as if replication is near real time, a read server can be promoted as the singular “write server” in the event of a “write server” failure. The combination of caching and read/write splits (our X axis) is sufficient for many companies but for companies with extreme hyper growth and massive data retention needs it is often not enough.
The Y Axis of our database cube represents a split by function, service or resource just as it did with the service cube. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in speeding up database calls by allowing more information specific to the request to be held in memory rather than needing to make a disk access. Just as with our approach in scaling services, our recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. It is possible to take a monolithic application and perform physical splits by say URL/URI to different service or resource oriented pools. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service / pool / resource / application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as “swimlaning” an application and data set, especially when both the database and applications are split to represent a “failure domain” or fault isolative infrastructure.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). The most common way to view this is to consider splitting your resources by customer if your entity relationships allow that to happen. In the world of media, you might consider splitting it by article_id or media_id and in the world of commerce a split by product_id might be appropriate. In the case where you split customers from your products and perform splits within customers and products you would be implementing both a Y axis split (splitting by resource or call – customers and products) and a Z axis split (a
modulus of customers and products within their functional splits).
Z axis splits tend to be the most costly for an engineering team to perform as often many functions that might be performed within the database (joins for instance) now need to be performed within the application. That said, if done appropriately they represent the greatest potential for scale for most companies.
April 3, 2017 | Posted By: AKF
Splitting Applications or Services for Scale
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and potentially communicating with a database. Many if not all of the functions are likely to exist within a monolithic application code base making use of the same physical and virtual resources of the system upon which the functions operate: memory, cpu, disk, network interfaces, etc. Potentially the engineers have the forethought to make the system highly available by positioning a second application server in the mix to be used in the event that the first application server fails.
This monolithic design will likely work fine for many sites that receive low levels of traffic. However, if the product is very successful and receives wide and fast adoption user perceived response times are likely to significantly degrade to the point that the product is almost entirely unusable. At some point, the system will likely even fail under the load as the inbound request rate is significantly greater than the processing power of the system and the resulting departure rate of responses to requests.
A great engineering team will think about how to scale their platform well in advance of such a catastrophic failure. There are many ways to approach how to think about such scalability of a platform and we present several through a representation of a three dimensional cube addressing three approaches to scale that we call the AKF Scale Cube.
The AKF Scale Cube consists of an X, Y and Z axes – each addressing a different approach to scale a service. The lowest left point of the cube (coordinates X=0, Y=0 and Z=0) represents the worst case monolithic service or product identified above: a product wherein all functions exist within a single code base on a single server making use of that server’s finite resources of memory, cpu speed, network ports, mass storage, etc.
The X Axis of the cube represents a means of spreading load across multiple instances of the same application and data set. This is the first approach most companies use to scale their services and it is effective in scaling from a request per second perspective. Oftentimes it is sufficient to handle the scale needs of a moderate sized business. The engineering cost of such an approach is low compared to many of the other options as no significant rearchitecting of the code base is required unless the engineering team needs to eliminate affinity to a specific server because the application maintains state. The approach is simple: clone the system and service and allow it to exist on N servers with each server handling 1/Nth the total requests. Ideally the method of distribution is a loadbalancer configured in a highly available manner with a passive peer that becomes active should the active peer fail as a result of hardware or software problems. We do not recommend leveraging roundrobin DNS as a method of load balancing. If the application does maintain state there are various ways of solving this including a centralized state service, redesigning for statelessness, or as a last resort using the load balancer to provide persistent connections. While the Xaxis approach is sufficient for many companies and distributes the processing of requests across several hosts it does not address other potential bottlenecks like memory constraints where memory is used to cache information or results.
The Y Axis of the cube represents a split by function, service or resource. A service might represent a set of usecases and is most often easiest to envision through thinking of it as a verb or action like “login” and a resource oriented split is easiest to envision by thinking of splits as nouns like “account information”. These splits help handle not only the split of transactions across multiple systems as did the X axis, but can also be helpful in reducing or distributing the amount of memory dedicated to any given application across several systems. A recommended approach to identify the order in which these splits should be accomplished is to determine which ones will give you the greatest “headroom” or capacity “runway” for the least amount of work. These splits often come at a higher cost to the engineering team as very often they will require that the application be split up as well. As a quick first step, a monolithic application can be placed on multiple servers and dedicate certain of those servers to specific “services” or URIs. While this approach will help spread transaction processing across multiple systems similar to our X axis implementation it may not offer the added benefit of reducing the amount of system memory required by service/pool/resource/application. Another reason to consider this type of split in very large teams is to dedicate separate engineering teams to focus on specific services or resources in order to reduce your application learning curve, increase quality, decrease time to market (smaller code bases), etc. This type of split is often referred to as
“swimlaning” an application.
The Z Axis represents ways to split transactions by performing a lookup, a modulus or other indiscriminate function (hash for instance). As with the Y axis split, this split aids not only fault isolation, but significantly reduces the amount of memory necessary
(caching, etc) for most transactions and also reduces the amount of stabile storage to which the device/service needs attach. In this case, you might try a modulus by content id (article), or listing id, or a hash from the received IP address, etc. The Z axis split is often the most costly of all splits and we only recommend it for clients that have hypergrowth or very high rates of transaction. It should only be used after a company has implemented a very granular split along the Y axis. That said, it also can offer the greatest degree of scalability as the number of “swimlanes within swimlanes” that it creates is virtually limitless. For instance, if a company implements a Z axis split as a modulus of some transaction id and the implementation is a configurable number “N”, then N can be 10, 100, 1000, etc and each order of magnitude increase in N creates nearly an order of magnitude of greater scale for the company.
April 3, 2017 | Posted By: AKF
Here are a baker’s dozen of items that we feel are Best Practices for Scalability:
Use asynchronous communication when possible. Synchronous calls tie the availability of the two services together. If one has a failure or is slow the other one is affected.
2. Swim Lanes
Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
Replicate databases for recovery as well as to off load reads to multiple instances.
Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
7. Use Few RDBMS Features
Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
8. Slow Roll
Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
9. Load & Performance Testing
Test the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
10. Capacity Planning / Scalability Summits
Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
Always have the ability to rollback a code release.
12. Root Cause Analysis
Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
13. Quality From The Beginning
Quality can’t be tested into a product, it must be designed in from the beginning.
1 2 >