January 17, 2018 | Posted By: AKF
What questions do each of your system monitors answer? You probably think they answer questions such as “Is there a problem?” and if so “Where is the problem?” Most likely this is not the case and instead of telling you “Is there a problem?” it really only tells you “Where” or “What” the problem might be. Before we continue this, first a quick detour to discuss metrics, which while different than monitoring are very similar in many ways. Eric Ries, co-founder and CTO of IMVU, posted an article about the difference between vanity metrics and actionable metrics. The entire article and accompaning video are worth a read and listen, but the take away is that most people are using and looking for metrics that are great soundbites but do not offer any definable actions. One example is the total number of hits to a website. Eric ask the questions “Now what? Do you really know what actions you took in the past that drove those visitors to you, and do you really know which actions to take next?” This makes total sense to me as we often see teams misusing monitoring in an attempt to determine what actions to take with their systems. Back to our discussion of what question your monitoring is attempting to answer. We think there are five evolutionary questions that monitoring should answer:
- Is there a problem?
- Where is the problem?
- What is the problem?
- Why is there a problem?
- Will there be a problem?
Where most people fail is using a monitoring tool that is designed to answer “Where” or “What” and try to use it to answer “Is”. For example, if you are monitoring all of your servers; vitals such as CPU, memory, and I/O what is the appropriate action for your team to take when the CPU utilization goes to 100%? The reason that might be a tough question is that you are missing the vital piece of information “Is this affecting my customers?”. The “Is there a problem” is intended to be a proxy for customer impact in order to help determine the degree and speed of escalation of the issue.
If you have monitoring services in place now it is worthwhile to determine what question each one answers. If you are missing a monitor for a particular question, the time to remedy it is before you need that question answered.
January 13, 2018 | Posted By: Dave Swenson
Sorry, False Alarm…
On January 13, 2018, what felt like an episode of Netflix’s “Black Mirror” unfolded in real life. Just after 8 in the morning, residents and visitors of Hawaii were woken up to the following startling push notification:
Thankfully, the notification was a false alarm, finally retracted with a second notification nearly 40 interminable minutes later.
The amazing, poignant and sobering stories that occurred from those 40 minutes, included people:
- determining which children to spend their last minutes with,
- abandoning their cars on streets,
- sheltering in a lava tube,
- believing and acting as we all would if we believed the end was here.
Unfortunately, this wasn’t a Black Mirror episode and paralyzed an entire state’s population. Thankfully, the alarm was a false one.
A Muted President
As President Trump took office, he introduced a new means for a President to reach his constituents—Twitter, averaging 6 to 7 tweets per day during his first year. On November 2, 2017, many bots that were created to closely monitor the tweets of @realDonaldTrump started reporting that the account no longer existed. Clicking to his account took the user to the above error page.
For a deafening 11 minutes, the nation was unable to listen to its leader, at least via Twitter.
The Hawaiian false alarm was sent by the state’s Emergency Management Agency. Their explanation of the incident was that during a shift change, an employee clicked “the wrong button” while running a missile crisis test, then subsequently clicked through a confirmation prompt (“Are you sure you want to tell 1.5 million people this?”).
Twitter employees had reportedly tried for years to get management attention on ensuring accounts weren’t deleted without proper vetting. The company typically used contractors in the Philippines and Singapore to handle such account administration; Trump’s account was deleted by a German contract worker on his last day at Twitter. Acting on yet-another-Trump-complaint, believing such an important account couldn’t be suspended, the worker’s last action for Twitter was to click the suspend button, and then walked out of the building causing the Twitterverse to read far more into the account’s disappearance than they should have.
In both of these situations, the immediate focus was on the personnel involved in the incident. “Who pushed the button?” is typically always one of the initial questions. Assumptions that a new employee, or rogue worker were behind the incident are common, and both motive and intelligence of all involved are under inspection.
We at AKF Partners constantly preach “An incident is a terrible thing to waste”. Events such as these warp the known reality into “How the shit can that happen??”, causing enough alarm to warrant special attention and focus, if not panic. Yet, all too often we see teams searching frantically to find any cause, blame the most obvious, immediate factor, declare victory, and move on.
“Who pushed the button?” is only one of many questions.
Toyota’s Taichi Ohno, the father of Lean Manufacturing, recognized his team’s habit of accepting the most apparent cause, ignoring (wasting) other elements revealed by an incident, potentially allowing it to be eventually repeated. Ohno (the person, not the exclamation typically uttered during an incident) emphasized the importance of asking “5 Why’s” in order to move beyond the most obvious explanation (and accompanying blame), to peel the onion diving deeper into contributory causes.
Questions beyond the reflexive “What happened?” and “Who did it?” relevant to the false alarm and erroneous account deletion incidents include:
- Why did the system act differently than the individual expected (is there more training required, is the user interface a confusing one)?
- Why did it take so long to correct (is there no playbook for detecting / reversing such a message or key account activity)?
- Why does the system allow such an impactful event to be performed unilaterally, by a single person (what safeguards should exist requiring more than one set of hands?)
- Why does this particular person have such authorization to perform this action (should a non-employee have the ability to delete such a verified, popular and influential account)?
- Why was the possibility of this incident not anticipated and prevented (why were Twitter employee requests for better safeguards ignored for years, why wasn’t the ease of making such a mistake recognized and what other similar mistake opportunities are there)?
Both of these incidents have had an impact far beyond those directly affected (Hawaiian inhabitants or Trump Twitter followers), and have shed light on the need to recognize the world has changed and policies and practices of old might not be enough for today. The ballistic missile false alarm revealed that more controls need to be placed on all mass communication, but also that Hawaii (or anywhere/anyone else) is extremely unprepared for the unthinkable. The use of Twitter as a channel for the President now raises questions over the validity of it as a Presidential record, asks who should control such a channel, and raises concerns on what security is around the President’s account?
Ask 5 Whys, look beyond the immediate impact to find collateral learnings, and take notice of all that an incident can reveal.
AKF Partners have been brought in by over 400 companies to avoid such incidents, and when they do occur, to learn from them. Let us help you.
January 3, 2018 | Posted By: AKF
One of the most common questions we get is “What are the most common failures you see tech and product teams make?”. To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to design for rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW.
2) Confusing product release with product success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy.
3) Insular product development/engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work”.
4) Over engineering the solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
5) Allowing history to repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Scaling through 3d parties
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to find your mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering. Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “big bang” fixes
In our experiences, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc then the product of all of the service calls is your theoretical availability. 5 calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to create and incent a culture of excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth.
11) Under-engineering for scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now. That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point on relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A new PDLC will fix my problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs.
14) We cannot hire great people quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) It is a SPOF (Single Point of Failure) but we can recover it onto another host quickly
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS based company, the site and services provided is the company’s sole source of revenue. Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
18) No Product Management team or person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) It is okay to bring the site down to roll code
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down. It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Like this article? Share it with friends here, and subscribe to the newsletter here.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
December 14, 2017 | Posted By: Marty Abbott
The Law that Almost Wasn’t
Conway’s law had a rather precarious beginning. Harvard Business Review rejected Conway’s thesis, buried as it was in the 43d paragraph of a 45-paragraph paper, on the grounds that he had not proven it.
But Mel had a PhD in Mathematics (from Case Western Reserve University – Go Spartans!), and like most PhDs he was accustomed to journal rejections. Mel resubmitted the paper to Datamation, a well-respected IT journal of the time, and his paper “How Do Committees Invent” was published in 1968.
It wasn’t until 1975, however, that the moniker “Conway’s Law” came to be. Fred Brooks both coined the term and popularized Conway’s thesis in his first edition of the Mythical Man Month. It has since been one of the most widely cited, important but nevertheless incorrectly understood and applied notions in the domain of product development.
Cliff’s Notes to “How Do Committees Invent” (the article in which the law resides)
Conway’s thesis, in his words:
… organizations which design systems (in the broad sense used here) are constrained to produce designs which are copies of the communication structures of these organizations.
Conway calls this self-similarity between organizations and designs homomorphism. Preamble to the thesis helps explain the breadth and depth:
… the very act of organizing a design team means that certain design decisions have already been made, explicitly or otherwise
Every time a delegation is made … the class of design alternatives which can be effectively pursued is also narrowed.
Because the design which occurs first is almost never the best possible, the prevailing system concept may need to change. Therefore, flexibility of organization is important to effective design.
Specifically, each individual must have at most one superior and at most approximately seven subordinates
Examples. A contract research organization had eight people who were to produce a COBOL and an ALGOL compiler. After some initial estimates of difficulty and time, five people were assigned to the COBOL job and three to the ALGOL job. The resulting COBOL compiler ran in five phases, the ALG0L compiler ran in three.
There are 4 very important points, and one very good example, in the quotes above:
1) Organizations and design/architecture and intrinsically linked. The organization affects and constrains the architecture - the opposite is not true.
2) Depth of an organization negatively effects design flexibility. The deeper the hierarchy of an organization, the less flexible (or alternatively more constrained) the resulting architecture.
3) We will make mistakes and must organize to quickly fix these.
4) Team size should always be small – which also has an implication to the size of the solution part a team can own (think Amazon’s re-branding of this point of the “2 Pizza Team” (author’s side note – read Scalability Rules for how this came about).
Important corollaries to Conway’s law suggest that if either an organization or a design change, without a corresponding change to the other, the product will be at risk.
Common Failures in Application of Conway’s Law and How to Fix Them
There are five very common failures in organization and architecture within our clients, the first four of which relate directly to Conway’s points above:
1) Organizations and architectures designed separately. Given the homomorphism that Conway describes, you simply CANNOT do this.
2) Deep, hierarchical organizations. Again – this will constrain design.
3) Lack of flexibility. Companies tend to plan for success. Instead, assume failure, learning, and adaptation (think “discovery” and “Agile” instead of “requirements” and “Waterfall”).
4) Large teams. Forget about these. Small teams, each owning a service or services that the team can support in isolation.
There is a fifth violation that is harder to see in Conway’s paper. Too often, our clients don’t build properly experienced teams around the solutions they deploy. Success in low-overhead organizations requires that teams be cross functional. Whatever a team needs to be successful should be within that team. If you deploy on your own hardware, you should have hardware experience. If you need DBA talent, the team should have direct access to that talent. QA folks should be embedded within the team, etc. Product managers or owners should also be embedded in the team. This creates our fifth failure:
5) Functional teams. Don’t build teams around “a skill” – build them around the breadth of skills necessary to accomplish the task handed to the team.
Conway’s Parting Shot and Food for Thought
Noodle on this: Conway identified a problem early in the life of a new domain. Yet what was true in Conway’s time as a contributor to the art is still true today, over 50 years after his first attempt to forewarn us:
Probably the greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.
Like this article? Share it with friends here, and subscribe to the newsletter here.
AKF Partners helps companies ensure that their organizations and architectures are aligned to the outcomes they desire. We help companies develop better, more highly available and more highly scalable products with faster time to market and lower cost. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Reach out to AKF
November 8, 2017 | Posted By: AKF
In 1965, psychologist Bruce Tuckman published his theory of group dynamics. This theory describes the stages (or phases) through which a team progresses enroute to optimal productivity. While generally useful for any organization, and prescriptive as to what leaders should do when to boost performance, it has profound impacts to Agile development practices and how we build organizations around these Agile practices.
The first stage is forming. This is where the team first comes together. Here, the individuals are trying to get to know each other. They tend to be polite and cordial, but they do not fully trust each other.
In this stage, the team productivity and team conflict are low. The team spends time agreeing to what the team is supposed to do. This lack of agreement of the team’s purpose can cause members to miss goals because they are individually targeting different things. Team members rely on patterned behavior and look to the team leader for guidance and direction. The team members want to be accepted by the group. Cautious behavior on the part of the team starts to depress overall team outcomes. Good leadership, emphasizing goals and outcomes is important to set the stage for future team behaviors and outcomes.
Once the team’s goals are clear, they move into the next stage, storming. Here, the team starts to develop a plan to achieve the goal and defines what to do and who does it. Friction starts to occur as members propose different ideas. Trust within the team remains low and affective conflict rises as people vie for control. Cliques can form. Productivity drops even lower than in the first stage.
Once the team agrees on the plan and the roles and responsibilities, it can move to the next stage. Without agreement, the team can get stuck. Symptoms include poor coordination, people doing the wrong things and missing deadlines, to name a few. Good leadership here focuses on fast affective conflict resolution, and serves to help reinforce team goals and outcomes in order to quickly move to more productive phases.
Once team members agree to the plan and understand their roles, they enter the norming phase. Affective conflict goes down, cognitive (beneficial) conflict and trust increase. The team focuses on how to get things done and productivity begins to increase. The team develops “norms” about how to work together and collaborate. A lack of these norms can cause issues such as low quality and missed deadlines.
Leadership within the team becomes clear and cliques dissolve. Members begin to identify with one another and the level of trust in their personal relationships contributes to the development of group cohesion. The team begins to experience a sense of group belonging and a feeling of relief from resolving interpersonal conflicts. Team identity starts to take hold and innovation and creativity within the team increases. The members feel an openness and cohesion on both a personal and task level. They feel good about being part of the team.
The final stage, preforming, is not achieved by all teams. This stage is marked by an interdependence in personal relations and problem solving within the realm of the team’s tasks. Team members share a common goal, understand the plan to achieve it, know their roles and how to work together. The team is firing on all cylinders. At this point, the team is highly productive and collaborates well. They are trusting of each other and “have each other’s back.” Healthy conflict is encouraged. There is unity: group identity is complete, group morale is high, and group loyalty is intense.
Not all teams get to this phase. They can get stuck in a previous phase or slide back into them from a higher phase. Leadership that focuses on affective conflict resolution, team identity creation, a compelling vision and goals to achieve that vision is critical to reaching the Performing phase. It is usually not easy for teams to quickly progress through these stages, and it often takes 6 months or more for a team to reach the Performing phase.
Impact to Agile Development
We often see companies make the mistake of coalescing teams around initiatives. Sometimes called “virtual teams” or “matrixed teams”, these teams suffer the underperforming phases of Tuckman’s curve repeatedly, especially when these initiatives are of durations shorter than 6 months. But even with durations of a year, six months of that time is spent getting the team to an optimum level of performance.
Tuckman’s analysis indicates that teams should be together for no less than a year (giving a 6 month return on a 6-month investment) and ideally for about 3 years. The upper limit being informed by the research on group think and its implications to creativity, performance and innovation within teams. Teams then should become semi-permanent and we should seek to move work to teams rather than form teams around work. To be successful here, we need multi-disciplinary teams capable of handling all the work they may get assigned. Further, the team needs to be familiar and “own” the outcomes associated with the solution (or architectural components) with which they work. More on that in future articles discussing Conway’s Law and Empathy Groups.
AKF Partners helps companies understand and apply the extant theory around organizational development in order to turbo-charge engineering performance. Wondering if your engineering productivity decreases as you grow your engineering and product teams? We can help you fix that and get your productivity back to the level it was as a startup!
Reach out to AKF
October 24, 2017 | Posted By: Marty Abbott
North Korea’s recent antics involving ballistic missiles and nuclear weapons are scary. While we seem to be edging ever closer to nuclear war – closer perhaps than any time since the Cuban Missile Crisis – the probability of such an occurrence remains relatively low. Even an apparently irrational head of state such as Kim Jong Un must understand that the use of a nuclear device will turn nearly the entire world against him. The use of a device against any nation would end his reign in relatively short order and end the People’s Republic of North Korea as we know it today. This then begs the question of why Kim Jong Un would participate in such brinkmanship? Many politicians and strategists seem to think it is a strategy to force other nations to recognize the PRNK and reduce the onerous sanctions currently levied against it by the United Nations. Perhaps, but maybe in addition to or instead, Jong Un is trying to take our eyes off the war he has been waging for many years: a cyber war against many nations.
Both cyber warfare on the part of a nation state and cyber terrorism waged by stateless entities aim to attack our economic infrastructure. Both North Korea and terrorists understand that attacking our economy, our businesses and our personal wealth are the most effective methods of causing harm to our nations and their citizens. North Korea is likely behind many recent attacks on financial institutions, has ties to the WannaCry ransomware outbreak, was behind the attack on Sony pictures and was involved in a heist of $81M from the Bangladesh Central Bank. Each of these were likely perpetrated by the formidable PRNK cyber warfare group “Unit 180”.
When not engaging in direct attacks to steal money from or otherwise harm business operations, both terrorists and nation states seek to use the products of a company for nefarious purposes. Recent examples include ISIS using eBay’s marketplace to funnel money to an operative in the US, and Russia purchasing advertising on Facebook in an attempt to influence the US election. Cyber warfare and terrorism are not just threats– they are daily occurrences. The foregoing examples illustrate how the game has changed. The question for you is - has your company changed enough to successfully protect itself against this growing and evolving threat?
The answer for most companies with which we work is “No”. Security organizations seem oblivious to the changing cyber threat. They continue to focus almost exclusively on barrier protection systems and cyber response processes. Few companies outside of the financial sector have developed analytics systems to help identify emerging threats and nefarious activity. Fewer still practice aggressive “patrolling” to identify threats outside of the perimeter of their digital operations. Here are a few questions to help you evaluate whether your company has the mindset necessary to be successful in the world of cyberwarfare and terrorism:
Who means you harm and how do they intend to perpetrate it?
Military veterans know that a successful defense requires more than just “Alamo’ing Up” behind a wall and hunkering down. You must patrol and reconnoiter the surrounding area to understand whence the enemy will come, in what numbers and with what capabilities. If your security team isn’t actively attempting to identify threats outside of your organization - and by this I mean beyond your walls - you are most certainly going to be surprised.
How do you find new and emerging behavior within your product and operations?
Given the threat of using your product for nefarious purposes, how do you identify when new and behaviors or trends emerge? What analytics systems do you have to identify that existing personas or users are acting in new or odd ways? How do you keep an eye on new patterns or trends of usage by both existing and new users? In very high transaction environments, how do you identify the less than 1 basis point of activity that may be nefarious in nature buried within 99.99% of valid transactions? These questions aren’t likely to be answered by a “traditional” security team – they require teams with deep analytic skills and systems dedicated to analytics and machine learning. Similarly, traditional analytics teams may not have the right mindset to seek out nefarious transactions.
Do you have the right people?
This is the most important question of all. You don’t need to fire your CSSP folks – you still have a need for them within your security team. But you also want folks with a proven record of being able to think like and use the tools of cyber criminals, terrorists, and warfare focused nation-states. These folks are unlikely to be willing to wear suits and ties to work, preferring instead to wear shorts and Birkenstocks. The traditional corporate mindset and tools will stand in the way of them being successful on your behalf. They need to use TOR browsers and have access to sites to which you are unlikely to want the remainder of your employees going. The biggest barrier to success here with most companies is fit with a company’s culture – but I can guarantee you that if you don’t have some of these folks on staff you are not going to be successful in this new era of cyber warfare.
How do you fare against the above questions? Are you properly set up to defend your company and your shareholders against the today’s cyber threat? If you are uncertain, reach out to AKF Partners – we’ll evaluate your security infrastructure and approach and help ensure that you can properly defend yourself against the growing threat.
October 15, 2017 | Posted By: Marty Abbott
I’m a huge Malcolm Gladwell fan. Gladwell’s ability to convey complex concepts and virtually incomprehensible academic research in easily understood prose is second to none within his field of journalism. A perfect example of his skill is on display in the Tipping Point, where Gladwell wrestles the topic of Complexity Theory (aka Chaos Theory) into submission, making it accessible to all of us. In the Tipping Point, Gladwell also introduces us to The Broken Windows Theory.
The Broken Windows Theory gets its name from a 1982 The Atlantic Monthly article. This article asked the reader to imagine a building with a few broken windows. The authors claim that the existence of these windows invite vandals to break still more windows. A continuous cycle of expanding vandalism ensues, with squatters moving in, nearby buildings getting vandalized, etc. Subsequent authors expanded upon the theory, claiming that the presence of vandalism invites other crimes and that crime rates soar in communities where unhandled vandalism is present. A corollary to the Broken Windows Theory is that cities can reduce crime rates by focusing law enforcement on petty crimes. Several high profile examples seem to illustrate the power and correctness of this theory, such as New York Mayor Giuliani’s “Zero Tolerance Program”. The program focused on vandalism, public drinking, public urination, and subway fare evasion. Crime rates dropped over a 10 year period, corresponding with the initiation of the program. Several other cities and other experiments showed similar effects. Proof that the hypothesis underpinning the theory is correct.
Not So Fast…
Enter the self-described “Rogue Economist” Stephen Levitt and his co-author Stephen Dubner - both of Freakonomics fame. While the two authors don’t deny that the Broken Windows theory may explain some drop in crime, they do cast significant doubt on the approach as the primary explanation for crime rates dropping. Crime rates dropped nationally during the same 10 year period in which New York pursued its Zero Tolerance Program. This national drop in crime occurred in cities that both practiced Broken Windows and those that did not. Further, crime rate dropped irrespective of either an increase or decrease in police spending. The explanation therefore, argue the authors, cannot primarily be Broken Windows. The most likely explanation and most highly correlated variable is a reduction in a pool of potential criminals. Roe v. Wade legalized abortion, and as a result there was a significant decrease in the number of unwanted children, a disproportionately high percentage of whom would grow up to be criminals.
Gladwell isn’t therefore incorrect in proffering Broken Windows as an explanation for reduction in crime. But the explanation is not the best one available and as a result it holds residence somewhere between misleading (worst case) and incomplete (best case).
To be fair, it’s hard to hold Gladwell accountable for this oversight. Gladwell is not a scientist and therefore not trained in how to scientifically evaluate the research he reported. Furthermore, his is an oft repeated mistake even among highly trained researchers. And what exactly is that mistake? The mistake made here is illustrated by the difference in approach between the Broken Windows researchers and the Freakonomics authors. The Broken Windows researchers started with something like the following question “Does the presence of vandalism invite additional vandalism and escalating crime?” Levitt and Dubner first asked the question “What variables appear to explain the rate of crime?”
Broken Windows started with a question focused on deductive analysis. Deduction starts with a hypothesis - “Evidence of vandalism and/or other petty crimes invites similar and more egregious crimes”. The process continues to attempt to confirm or disprove the hypothesis. Deduction starts with a broad and abstract view of the data – a generalization or hypothesis as to relationships – and attempts to move to show specific relationships between data elements. The Broken Windows folks started with a hypothesis, developed a series of experiments to test the hypothesis and then ultimately evaluated time series data in cities with various Broken Windows approaches to policing. What they lacked was a broad question that may have developed a range of options indicating possible causes.
The Freakonomics authors started with an inductive question. Induction is the process of moving from specific observations about data into generalizations. These generalizations are often in the form of hypothesis or models as to how data interacts. Induction helps to inform what questions should be asked of the data. Induction is the asking of “What change in what independent variables seem to correspond with a resulting change in some dependent variable?” Whereas deduction works from independent variable to dependent variable, induction attempts to work backwards from dependent variable to identify independent variable relationships.
The jump to deduction, without forming the right questions and hypotheses through induction, is the biggest mistake we see in developing Big Data programs and implementing Big Data solutions. We all approach problems with unique experiences and unique biases. The combination of these often cause us to race to hypotheses and want to test them. The issue here is two-fold. The best case is that we develop an incomplete (and as a result partially or mostly incorrect) answer similar to that of The Broken Windows researchers. The worst case is that we suffer what statisticians call a Type 1 error – confirming an incorrect answer. The probability of type 1 errors increases when we don’t look for alternative or better answers for outcomes within our data sets. Induction helps to uncover those alternative or supporting explanations. Exploring the data to discover potential relationships helps us to ask the right questions and form better hypotheses and better models. Skipping induction makes it highly probable that we will get an incorrect, misleading or substandard answer.
But it is not enough to simply ensure that we practice both induction and deduction. We must also recognize that the solutions that support induction are different from those that support deduction. Further, we must understand that the two processes while complimentary can actually interfere with each other when performed on the same system. Induction is necessarily a very broad and as a result slow and tedious process. Deduction, on the other hand, needs significantly less data and “prefers” to be faster in implementation. Inductive systems are best supported by solutions that impose very few relations or structure on the data we observe. Systems that support deduction, in order to allow for faster response times, impose increased structure relative to inductive systems. While the two phases of discovery (Induction and Deduction) support each other, their differences suggest that they should be performed on solutions purpose built to their specific needs.
Similarly, not everyone is equally qualified to perform both induction and deduction. Our experience is that the folks who tend to be good at determining how to prove relationships between variables are often not as good at identifying patterns and vice versa.
These two observations, that the systems that support induction and deduction should be separated and that the people performing these tasks may need to be different, have ramifications to how we develop our analytics systems and organize our Big Data teams. We’ll discuss these ramifications and more in our next post, “10 Anti-Patterns within Big Data”.
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
September 5, 2017 | Posted By: Roger Andelin
Last month, a bot developed by OpenAI (co-founded by Elon Musk) beat the world’s best, pro Dota 2 players. This is another milestone accomplishment in the field of artificial intelligence and machine learning and more fuel for the fire of concerns surrounding the AI debate. However, before we jump into that debate, here is some background you should understand about the technology fueling this debate.
The Evolution of Traditional Programming
A lot of what computer programming is can be simplified into three steps. First step, read in some data. Second step, do something with that data. Third step, output some result.
For example, imagine you want to fly somewhere for the weekend. You may first go to your travel app and input some dates, times, number of people traveling, airports, etc. Second, the app uses that information to search its database of available flights. Third, it returns a list of available flights for you to see.
This approach to software design has been the norm since the earliest days of programming. Artificial intelligence, in particular machine learning, has changed that approach. The first step is still the same: Read in some data. The third step is the same: Output some result.
However, with artificial intelligence technologies like machine learning, the second step, doing something with the data, is very different. In the example of finding a flight, a programmer easily can read the software code to understand the sequence of steps the computer has been programmed to do to produce the output data. If the programmer wants to change or improve the program’s behavior she can do that by writing new code or by altering the existing code. For example, if you wanted to compare the prices for available flights near the dates you have selected, a programmer can easily change several lines of code in the program to do just that. The programming code identifies every step the computer takes to arrive at its output. Said another way, the program only does what it’s specifically told to do in the code, nothing more or nothing less.
By contrast, the output of today’s most common machine learning programs is not determined by instructions written in computer code. There is no code for a programmer to read or modify when a change is desired. The output is determined by the program’s neural network.
Neural Networks in Action
What is a neural network? At the core of a neural network is a neuron. Similar to a traditional computer program, a neuron takes some input data, does a mathematical calculation on that data and then outputs some data. A typical neuron in a neural network will receive as input hundreds to thousands of numbers, typically between 0 and 1. A neuron will then multiply each number by a weight and sum the result of all the numbers. Many neurons will then convert the result into a number between 0 and 1. That result is then sent to the next neuron in sequence until the final output neuron is reach.
Here is an example of the math a typical neuron will do: If “x1, x2, x3…” represents input data and “w1, w2, w3…” represent the weights stored in the neuron, the calculation done by the neuron in a neural network looks like this: x1*w1 + x2*w2 + x3*w3 and so forth.
You can think of the calculation inside the neuron in a different way: The neuron is reading in a bunch of numbers and the weights in the neuron determine the importance, “or weight” of that input in producing the output. If the input is not important the weight for the input will be near zero and the input is not passed along to the next neuron. Therefore, the weights in a neuron effectively decided what input is valuable and what input should be ignored.
In a neural network, neurons like the one I described above are connected in parallel and in series to create a matrix of neurons. The input data to a neural network will go into hundreds or thousands of neurons in parallel, all with different weights. The output of those neurons is then sent to another layer of neurons and so forth, usually multiple layers deep. This is called a deep neural network. Another way to look at this is the neurons are grouped into a matrix of rows and columns, all interconnected. The final layer of the neural network is the output layer. Therefore, the final output of a neural network is the result of millions of calculations done by the neurons of that network.
When a programmer creates a neural network in software, the weights for each neuron are initially just random numbers. In other words, the weights arbitrarily decide to either diminish, increase or leave the input data alone, and output from the network is random. However, through a process called training, the weights move from randomly assigned values to values that can produce very useful outputs.
Training is both a time consuming and complicated mathematical process. However, it is much like training you and I would do to get better at something. For example, let’s say I wanted to learn how to shoot and arrow with a bow. I might pick up the bow and arrow, point it at the target, pull back the string and release. In my case, I know the arrow would miss the target. Therefore, I would try again and again making corrections to my aim based on on how far and which direction I was off from the target.
During the training process for a neural network, the weights in each of the neurons are changed slightly to improve the output, or “aim.” The most common approach for making those changes is called backpropagation. Backpropagation is a mathematical approach for applying corrections to every weight in every neuron of the network. During training, input is fed into the network and output is generated. The output is compared to the desired target and the difference between the output and the target is the error. Using the error, backpropagation makes changes to the weights in each neuron to reduce the error. If all goes well during training and backpropagation, the output error diminishes until it reaches expert or better than expert level.
AI vs Humans
In the case of the OpenAI Dota bot that recently beat the world’s best Dota 2 player, the outputs, which were a sequence of steps, strategies and decisions, went from random moves to moves that were so good the bot was able to easily defeat the best pro players in the world. The critical information that enabled the bot to win its matches was stored in the weights of the neurons and the neural network architecture itself.
A good question at this point is to ask if a programmer looking at the Dota 2 bot’s neural network could understand the steps taken by the bot to beat the human player. The answer is no. A programmer can see areas of the neural network that influence an output but it is not possible to explain why the bot took specific steps to formulate its moves and strategies. All the programmer would see is a huge matrix of weights that would be quite overwhelming to interpret.
Another good question to ask is whether or not a program written traditionally by a programmer with step by step instructions could beat the best Dota 2 player. The answer is no. Step by step programs where the programmer specifically instructs the computer to do something would easily be defeated by a professional player. However, a neural network can learn from training things that a programmer would never have the knowledge to program, store that learning in its neurons and use that learning to do things like defeat a human pro.
What makes the Dota 2 bot special is that it learned to beat the best pro players by playing against itself whereas most machine learning programs learn from training on data given to it by a programmer. In machine learning, good training data is like gold. It’s scarce and valuable. (note: This is one reason why Google and other big tech companies want to collect so much data.) Data is used to train neural networks to do useful things like recognizing people and places in your pictures or recognizing your voice from others in your family. OpenAI built a bot that learned almost entirely by playing against itself with the exception of some coaching provided by the OpenAI team. OpenAI has shown clearly that learning can occur without having tons of training data. It’s a little like being able to make gold.
Does the development of the OpenAI dota bot mean bots can now decide to train against themselves and become super bots? No. But it does say that humans can now program two bots to train against each other to become superbots. The key enabler being us. It’s anyone’s guess what type of bot can be imagined and developed in this way, useful or harmful. Obviously to most, a gaming superbot seems pretty innocuous, except of course to the gamer who may unexpectedly run into one during a match. However, it’s not hard to imagine super bots that are not so harmless. Or, perhaps you can just imagine a time when someone trains a bot to play football against itself until the bot becomes better at calling plays and strategy than every coach in the NFL. What happens then? The answer is disruption. Are you ready for it?
AKF Partners recommends that boards and executives direct their teams to identify sources of innovation and patterns of disruption that AI techniques may represent within their respective markets Walmart is already working on facial recognition technology in their stores to determine whether or not shoppers are satisfied at checkout. Will this give them a potential advantage over Amazon? How can machine learning and AI help you prevent fraud in your payment systems or the use of your commerce system to launder money?
AKF is prepared to help answer that question and others you may be facing. We will help you craft your AI strategy, sort through the hype, help you find the opportunities, and identify the potential threats of AI technology to your business.
Reach out to AKF
August 9, 2017 | Posted By: Marty Abbott
We have a saying in AKF Partners that “an incident is a terrible thing to waste”. When things go poorly in a firm, stakeholders (shareholders, partners, employees) pay a price. Having already paid a price, the firm must maximize the learning opportunity the incident presents. Google wasted such a learning opportunity by failing to capitalize on an incredible teaching moment with the termination of James Damore (the author of the sometimes called “Anti-Diversity Manifesto”). While Google seems to have “done the right thing” by firing Damore, it is unclear that they “did it for the right reason”. The “right reason” here is that diversity is valuable to a company because it increases innovation and in so doing increases the probability of success. Further, diversity is hard to achieve, takes great effort and can easily be derailed with very little effort. Companies simply cannot allow employees to work at odds with incredibly valuable diversity initiatives.
Diversity Drives Innovation and Success
My doctoral dissertation journey introduced me to diversity and its beneficial effects on innovation, time to market, and success within technology product firms. Put simply, teams that are intentionally organized to highlight both inherent (traits with which we are born) and acquired (traits we gain from experience) diversity achieve higher levels of innovation. Research published in the Harvard Business Review confirms this, indicating that diverse teams out innovate and out-perform other teams. Diverse teams are more likely to understand the broad base of needs of the market and clients they support. Companies with very diverse management teams are 35% more likely to have financial returns above the mean for their industry. Firms with women on their board on average have a higher ROE and net income than those that do not.
Differences in perspective and skills are things we should all strive to have in our teams. As we point out in The Art of Scalability, these differences increase beneficial cognitive conflict. Increases in cognitive conflict opens a range of strategic possibilities that in turn engender higher levels of success for the firm.
We have for too long allowed the struggle for diversity to be waged on the battleground of “fairness”. The problem with “fair” is that what is “fair’ to one person may seem inherently unfair to another. “Fair” is subjective and “fair” is too often political. “Success” on the other hand is objective and easily measured. Let’s move this fight to where it belongs and embrace diversity because it drives innovation and success. After all, anyone who can’t get behind winning, doesn’t deserve to be on a winning team.
Achieving Diversity is Hard
While the value of diversity is high, the cost to achieve it is also unfortunately high – especially within software teams. As my colleague Robin McGlothin recently wrote, the percentage of computer science degrees awarded to women over the last 25 years is declining. Most other minorities are similarly underrepresented in the field relative to their corresponding representation in the US population.
As in any market with high demand and low supply, companies need to find innovative ways to attract, grow and retain talent. These activities may include special mentoring programs, training programs, or scholarships at local universities meant to attract the group in question. These approaches may seem “unfair” to some, but they are in truth capitalism at its best - the application of market forces to solve a supply and demand problem. When a skill or trait is under high demand and short supply, the cost for that skill goes up. The extra activities above are nothing more than an increased cost to attract and retain the skills we value.
Companies desiring to achieve success in innovation through diversity MUST approach it in a steely, single-minded fashion. Any dissent as it relates to outcomes detracts from the probability of success. How many people with diverse backgrounds will leave or have left Google because of Damore’s missive? How many candidates won’t accept offers? Losing even one great candidate is an unacceptable additional cost given the already high cost to achieve success.
The Bottom Line
Structuring organizations and building cultures that tap the power of inherent and acquired diversity pays huge dividends for firms in terms of innovation, time to market, ROE and net income. While the rewards are high, the cost to achieve these benefits are also high. Success requires a steely, single-minded pursuit of diversity excellence.
The successful company will allow no dissent on this topic, as dissent makes the firm less attractive to the ideal candidate. Given a constrained supply under high demand, the candidate can and should go to the most welcoming environment available.
Put simply, Google did the right thing in firing Damore. But they failed to fully capitalize on the unfortunate event. The right answer, when asked about the reason for firing, would look something like this: “We recognize that diversity in experiences, background, gender and race drives higher levels of innovation and greater levels of success. Our culture will not tolerate employees who are not aligned with creating stakeholder value.”
Interested in driving innovation and time to market in your product and engineering teams? AKF Partners helps companies create experientially diverse product teams aligned with business outcomes to help turbo-charge performance.
1 2 3 >