February 13, 2018 | Posted By: Marty Abbott
Necessary But Insufficient Security Reviews
From a security perspective, tech product companies far too often focus solely on various ISO and/or NIST audits to help inform their view of how they manage risk within their company and their products. The problem with the standards that exist today is that none of them tread deeply enough into the waters of detection and prevention of malicious activities within products. Instead, they focus more on the processes of response, identification, notification, employee access, etc.
While these activities (and audits) are necessary, they are insufficient to ensure that we properly manage risk (and prevent malicious activities) in our products. As we’ve written previously, erecting barriers and hiding behind big walls may make you feel better and help you sleep at night – but it’s not going to keep the bad guys from scaling your walls and taking your stuff.
The Online World is Getting Scarier
Consider the following secular trends for online products:
• A continuing mix-shift of commerce from retail to online. Within the US today, excluding certain goods, this number stands at a meager 9% of total commerce in 2017 up from 1% in 2002. If one excludes extremely high dollar items (vehicles, etc) the percentage of sales is significantly higher. Growing at a slightly higher than linear rate since 2002, this number should easily double within the next 7 years. From the perspective of a malicious hacker, this is a growth in opportunity.
• Developing and established nations outside of N. America and Western Europe continue to invest heavily in STEM-based education.
• Overall employment in many of these countries is comparatively low outside of what Western Nations provide through off-shore contracting opportunities. Combined with recent nationalistic trends and a desire to “keep jobs at home” or not “offshore jobs” there is a strong possibility that demand for offshore agencies will decrease over time.
• Some nations within the set of nations spending heavily on STEM education, have created cyber-institutes promoting cyber and security related warfare capabilities.
• A smaller set of the nations described above have heavily promoted state sponsored cyber warfare initiatives, setting these teams (e.g. the PRNK’s Unit 180) against corporate infrastructure within the United States.
• The barrier to entry for malicious actors to be effective in attacking corporate assets has declined. Hacker communities commonly share exploits and malware, and certain nation-states (e.g. Russia and N. Korea) have contributed to hacking toolsets, thereby decreasing the barrier to entry for a malicious actor and resultingly increasing the supply of said malicious actors.
• Extradition from other countries for crimes committed, especially those with which the US is not allied, is difficult to impossible. View this as a low perceived cost of committing a crime. If you cannot be prosecuted, there is no to low perceived cost of committing the crime.
• Crypto-currency (e.g. Bitcoin) provide a near untraceable means of selling stolen data, or holding systems for ransom.
The resulting forces of these meta or secular trends are clear:
1) The value of being a malicious actor has increased as the supply (in terms of sales/value) continues to increase. View this economically as an increasing opportunity for crime.
2) The barrier to entry to become a malicious actor is decreasing.
3) The cost in terms of prosecution, if performed outside the US is low to zero.
These points combine to make one clear outcome: Cybercrime and cyberterrorism (fraud, malicious use, etc) will rise as a percentage of revenue transacted online.
To help combat this rising malicious activity, we need new models and approaches to help us think about how to Identify and Prevent bad actors from doing horrible things.
Enter the AKF Security Insights Cube.
If It Isn’t Real Time It Is Worthless
The AKF Partners Security Insights Cube is predicated on the notion that all the data it addresses is accessible in near-real-time. This alone is a considerable barrier for many companies. Identifying fraudulent activity after credit cards are processed, for instance, is simply too late. We want to know that bad people are entering our neighborhood and at our door – not that they stole something from our house yesterday.
The lower left corner of the cube is the starting point for any solution – the point at which you are flying blind and have no real time data. Again – getting data from 15 minutes ago or 24 hours ago is as useless in driving a product as it is in driving a car or flying a plane; you simply have no idea what is going on.
The X axis of the cube evaluates the breadth of data available to an organization in real time. The far left is “zero real time data”. Progressing to the right of the axes are increasingly valuable risk related data points from real time key performance indicators like logins, add-to-carts, check-outs, auth activity (and failures), searches, etc. Moving further right, we may keep all session data such that we can interrogate and perform behavioral analysis and pattern matching. The far right of the axis is the point at which we keep absolutely everything, increasing the optionality of how we may interrogate the data for risk management and malicious activity prevention purposes.
The Y axis of the cube evaluates the activities performed upon the X axis data by an organization. Clearly here the X axis sets an upper bound on what’s possible on the Y axis. For instance, it would be hard to understand “Who, What or How” something happened if we didn’t first store session data to be analyzed. From a GDPR perspective, PII can be anonymized if necessary in session information. As with most analytics oriented system, maturity progresses from doing nothing, to “reporting” capabilities that illuminate “what is happening” (typically employing performance indicators), to answering “Who, Why and How” to finally predicting what will happen and preventing malicious activities in real time.
The Z axis of the cube deals simply with the depth, or duration, that data is kept. We rarely suggest that data be kept forever, but there is great value in ensuring that past patterns can be analyzed to create behavior models for scoring risk and blocking activities. A handful of years is typically appropriate for most commerce solutions, slightly more data for fintech solutions.
AKF Partners performs security reviews of technology products. Our approach evaluates security among several dimensions and includes components of NIST and ISO standards, but is tailored to the needs of online product companies.
Subscribe to the AKF Newsletter
February 7, 2018 | Posted By: Pete Ferguson
If you have a premium product, at a premium price, it’s unlikely you would sell it out of a rundown, poorly lighted store that smells vaguely like stale meat. Yet somehow many of us forget to apply that same reasoning when it comes to selling our products online. The availability - and look and feel of your presence online - is your store front.
I’ve long been a fan of Saddleback Leather. However, their motto: “They’ll fight over it when you’re dead” fell short in January. You see, it’s hard for your family to fight over the thing that you can’t even purchase… Saddleback Leather had a completely foreseeable, and absolutely preventable outage. From Dave Munson, the CEO:
“I’ve always dreamt of one day having a really fast and easy website for you to enjoy. So, we decided to leave our slow and clunky old website and start building one on a new and different platform. The contract expired Dec. 30th, 2017, but the new site wasn’t fully ready yet. We flipped the switch anyways and all Gehenna broke loose. The super fast, fun and easy website… wasn’t fast, fun or easy and we wasted a ton of time and irritated the heck out of our favorite people. People couldn’t check out, set up accounts or even add stuff to their carts. So, we paid a ton of money to get our old slow and clunky back again until we get this new site just right. “
To make up for it, last week I received an apology letter sent by “El Presidente” Munson with an 11% off coupon. 11 % because Munson has recently celebrated 11 years of marriage to his wife, Suzzette. As a side note, it’s a perfect example of how to apologize to your customers when you screw up. This guy made a mistake, is paying for it by paying for his old site while continuing to develop the new, and is giving customers discounts with a coupon aptly titled: “IAMSORRY.”
Ironically, as a fan and customer, I don’t recall the old site being slow or terrible. On the contrary, when I visited early in January, their “new and improved” site felt clunky and disjointed. The wrong images were coming up for products and many items reported being “not available.”
In the world of environmental health and safety, “all accidents are preventable” is the holy grail of compliance. We believe that with the right forethought and planning, the same is true with virtually all products and storefronts online.
At AKF we are fond of saying “an accident is a terrible thing to waste.” While the exact details of what went wrong are not disclosed, the motives were:
- They took a concept that presumably worked great in beta testing live without testing under full load.
- Munson made the decision to push out something that wasn’t yet great to save money by exiting a contract by the end of the year.
For similar content on our Growth Blog, click here
The result is lost sales from when the site was down, lost customers who may have been trying the website for their first time and won’t be back, an 11% haircut of sales for the next week, and a fan base - many of whom have been very vocal on FaceBook - that is verbally expressing their disdain to see the company they have counted on for unquestioned quality in the past didn’t settle for quality first this time.
The days of customers quickly forgiving their favorite retailers for not being equally as great online are waning. Make sure you have a solid strategy and the right expertise in your corner when it comes to greatly affecting your customer’s ability to purchase or better interact with your product.
Experiencing growing pains? AKF is here to help! We are an industry expert in technology scalability and due diligence. Put our 200+ years of combined experience to work for you today!
Get this article and others like it by signing up for our newsletter.
Subscribe to the AKF Newsletter
January 29, 2018 | Posted By: Robin McGlothin
The Scale Cube - Architecting for Scale
In every Industry, companies with similar strategies grow differently. Some grow with metronomic regularity to become leaders in their segment. Walmart, Dell, Amazon exemplify this trend. Others grow in fits and starts, often languishing, or at best, get acquired. While several attributes define successful companies, one is often overlooked – their ability to scale.
The Scale Cube is a model for building resilient architectures using patterns and practices that apply broadly to any solution. We developed the cube 11 years ago for our practice and included it in the first edition of “The Art of Scalability” (AKF Partners ).
The Scale Cub” (sometimes known as the“AKF Scale Cube or AKF Cube) is comprised of an X-axis, Y-axis, and Z-axis.
• Horizontal Duplication and Cloning (X-Axis )
• Functional Decomposition and Segmentation - Microservices (Y-Axis)
• Horizontal Data Partitioning - Shards (Z-Axis)
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” to designing scalable solutions. An architecture is scalable if each component is scalable. For example, a well-designed solution should be able to scale seamlessly as demand increases and decreases, and be resilient enough to withstand the loss of one or more compute resources.
Most internet enabled products start their life as a single application running on an appserver or appserver/webserver combination and likely communicate with a database. This monolithic design will work fine for relatively small applications that receive low levels of client traffic. However, this monolithic architecture becomes a kiss of death for complex applications.
A large monolithic application can be difficult for developers to understand and maintain. It is also an obstacle to frequent deployments. To deploy changes to one application component you need to build and deploy the entire monolith, which can be complex, risky, time consuming, require the coordination of many developers and result in long test cycles.
Consequently, you are often stuck with the technology choices that you made at the start of the project. In other words, the monolithic architecture doesn’t scale to support large, long-lived applications.
Scaling Solutions with the Scale Cube
The most commonly used approach of scaling an solution is by running multiple identical copies of the application behind a load balancer also known as X-axis scaling. That’s a great way of improving the capacity and the availability of an application.
When using X-axis scaling each server runs an identical copy of the service (if disaggregated) or monolith. One benefit of the X axis is that it is typically intellectually easy to implement and it scales well from a transaction perspective. Impediments to implementing the X axis include heavy session related information which is often difficult to distribute or requires persistence to servers – both of which can cause availability and scalability problems. Comparative drawbacks to the X axis is that while intellectually easy to implement, data sets have to be replicated in their entirety which increases operational costs. Further, caching tends to degrade at many levels as the size of data increases with transaction volumes. Finally, the X axis doesn’t engender higher levels of organizational scale.
Y-axis scaling (think services oriented architecture, micro services or functional decomposition of a monolith) focuses on separating services and data along noun or verb boundaries. These splits are “dissimilar” from each other. Examples in commerce solutions may be splitting search from browse, checkout from add-to-cart, login from account status, etc. In implementing splits, Y-axis scaling splits a monolithic application into a set of services. Each service implements a set of related functionalities such as order management, customer management, inventory, etc. Further, each service should have its own, non-shared data to ensure high availability and fault isolation. Y axis scaling shares the benefit of increasing transaction scalability with all the axes of the cube.
Further, because the Y axis allows segmentation of teams and ownership of code and data, organizational scalability is increased. Cache hit ratios should increase as data and the services are appropriately segmented and similarly sized memory spaces can be allocated to smaller data sets accessed by comparatively fewer transactions. Operational cost often is reduced as systems can be sized down to commodity servers or smaller IaaS instances can be employed.
Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things. Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary. Product catalogs may be split by SKU, and content may be split by content_id. Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability. Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.
Like Goldilocks and the three bears, the goal of decomposition is not to have services that are too small, or services that are too large but to ensure that the system is “just right” along the dimensions of scale, cost, availability, time to market and response times.
AKF Partners has helped hundreds of companies, big and small, employ the AKF Scale Cube to scale their technology product solutions. Let us help you succeed and thrive!
Subscribe to the AKF Newsletter
January 23, 2018 | Posted By: Marty Abbott
Technical due diligence of products is about more than the solution architecture and the technologies employed. Performing diligence correctly requires that companies evaluate the solution against the investment thesis, and evaluate the performance and relationship of the engineering and product management teams. Here we present the best practices for technology due diligence in the format of things to do, and things not to do:
1. Understand the Investment/Acquisition Thesis
One cannot perform any type of diligence without understanding the investment/acquisition thesis and equally as important, the desired outcomes. Diligence is meant to not only uncover “what is” or “what exists”, but also identify the obstacles to achieve “what may or can be”. The thesis becomes the standard by which the diligence is performed.
2. Evaluate the Team against the Desired Outcomes
The technology product landscape is littered with the carcasses of great ideas run into the ground with the wrong leadership or the wrong team. Disagree? We ask you to consider the Facebook and Friendster battle. We often joke that the robot apocalypse hasn’t happened yet, and technology isn’t building itself. Great teams are the reasons solutions succeed and substandard teams behind those solutions that fail technically. Make sure your diligence is identifying whether you are getting the right team along with the product/company you acquire.
3. Understand the Tech/Product Relationship
Product Management teams are the engines of products, and engineering teams are the transmission. Evaluating these teams in isolation is a mistake – as regardless of the PDLC (product development lifecycle) these teams must have an effective working relationship to build great products. Make sure your diligence encompasses an evaluation of how these teams work together and the lifecycle they use to maximize product value and minimize time to market.
4. Evaluate the Security Posture
Cyber-crime and fraud is going to increase at a rate higher than the adoption of online solutions pursuant to a number of secular forces that we will enumerate in a future post. As such, it is in your best interest as an investor to understand the degree to which the company is focused on increasing the perceived cost of malicious activity and decreasing the perceived value of said malicious activity. Ensure that your diligence includes evaluating the security focus, spending, approach and mindset of the target company. This need not be a separate diligence for small investments – just ensure that you are comfortable with the spend, attention and approach.
1. Don’t Waste Too Much Time (or money) on Code Reviews
The one thing I know from years of running engineering teams is that anytime an engineer reviews code for the first time she is going to say, “This code is crap and needs to be rewritten.” Code reviews are great to find potential defects and to ensure that code conforms to the standards set forth by the company. But you are unlikely to have the time or resources to review everything. The company is also unlikely to give you unfettered access to all of their code (Google “Sybase Microsoft SQLServer” for reasons why). That leaves you at the whims of the company to cherry-pick what you review, which in turn means you aren’t getting a good representative sample.
Further, your standards likely differ from those of the target company. As such, a review of the software is simply going to indicate that you have different standards.
Lastly, we’ve seen great architecture and terrible code succeed whereas terrible architecture and great code rarely is successful. You may find small code reviews enlightening, but we urge you to spend a majority of your time on the architecture, people and process of the acquisition or investment.
2. Don’t Start a Fight
Far too often technology diligence sessions start in discussion and end in a fight. The people performing the diligence start asking questions in a way that may seem judgmental to the target company. Then the investing/acquiring team shifts from questions to absolute statements that can only be taken as judgmental. There’s simply no room for this. Diligence is clinical – not personal. It’s not a place to prove who is smarter than whom. This dynamic is one of the many reasons it is often a good idea to have a third party perform your diligence: The target company is less likely to feel threatened by the acquiring product team, and the third party is oftentimes more experienced with establishing a non-threatening environment.
3. Don’t Be Religious
In a services oriented world, it really doesn’t matter what code or what data persistence platform comprises a service you may be calling. Assuming that you are acquiring a solution and its engineers, you need not worry about supporting the solution with your existing skillsets. Debates around technology implementations too often come from a place of what one knows (“I know Java, Java rocks, and everything else is substandard”) than what one can prove. There are certainly exceptions, like aging and unsupported technology – but stay focused on the architecture of a solution, not the technology that implements that architecture.
4. Don’t Do Diligence Remotely
As we’ve indicated before, diligence is as much about teams as it is the technology itself. Performing diligence remotely without face to face interaction makes it difficult to identify certain cues that might otherwise be indicators that you should dig more deeply into a certain space or set of questions. Examples are a CTO giving an authoritative answer to a certain question while members her team roll their eyes or slightly shake or bow their heads.
AKF Partners performs diligence on behalf of a number of venture capital and private equity firms as well as on behalf of strategic acquirers. Whether for a third party view, or because your team has too much on their plate, we can help.
Subscribe to the AKF Newsletter
January 13, 2018 | Posted By: Dave Swenson
Sorry, False Alarm…
On January 13, 2018, what felt like an episode of Netflix’s “Black Mirror” unfolded in real life. Just after 8 in the morning, residents and visitors of Hawaii were woken up to the following startling push notification:
Thankfully, the notification was a false alarm, finally retracted with a second notification nearly 40 interminable minutes later.
The amazing, poignant and sobering stories that occurred from those 40 minutes, included people:
- determining which children to spend their last minutes with,
- abandoning their cars on streets,
- sheltering in a lava tube,
- believing and acting as we all would if we believed the end was here.
Unfortunately, this wasn’t a Black Mirror episode and paralyzed an entire state’s population. Thankfully, the alarm was a false one.
A Muted President
As President Trump took office, he introduced a new means for a President to reach his constituents—Twitter, averaging 6 to 7 tweets per day during his first year. On November 2, 2017, many bots that were created to closely monitor the tweets of @realDonaldTrump started reporting that the account no longer existed. Clicking to his account took the user to the above error page.
For a deafening 11 minutes, the nation was unable to listen to its leader, at least via Twitter.
The Hawaiian false alarm was sent by the state’s Emergency Management Agency. Their explanation of the incident was that during a shift change, an employee clicked “the wrong button” while running a missile crisis test, then subsequently clicked through a confirmation prompt (“Are you sure you want to tell 1.5 million people this?”).
Twitter employees had reportedly tried for years to get management attention on ensuring accounts weren’t deleted without proper vetting. The company typically used contractors in the Philippines and Singapore to handle such account administration; Trump’s account was deleted by a German contract worker on his last day at Twitter. Acting on yet-another-Trump-complaint, believing such an important account couldn’t be suspended, the worker’s last action for Twitter was to click the suspend button, and then walked out of the building causing the Twitterverse to read far more into the account’s disappearance than they should have.
In both of these situations, the immediate focus was on the personnel involved in the incident. “Who pushed the button?” is typically always one of the initial questions. Assumptions that a new employee, or rogue worker were behind the incident are common, and both motive and intelligence of all involved are under inspection.
We at AKF Partners constantly preach “An incident is a terrible thing to waste”. Events such as these warp the known reality into “How the shit can that happen??”, causing enough alarm to warrant special attention and focus, if not panic. Yet, all too often we see teams searching frantically to find any cause, blame the most obvious, immediate factor, declare victory, and move on.
“Who pushed the button?” is only one of many questions.
Toyota’s Taichi Ohno, the father of Lean Manufacturing, recognized his team’s habit of accepting the most apparent cause, ignoring (wasting) other elements revealed by an incident, potentially allowing it to be eventually repeated. Ohno (the person, not the exclamation typically uttered during an incident) emphasized the importance of asking “5 Why’s” in order to move beyond the most obvious explanation (and accompanying blame), to peel the onion diving deeper into contributory causes.
Questions beyond the reflexive “What happened?” and “Who did it?” relevant to the false alarm and erroneous account deletion incidents include:
- Why did the system act differently than the individual expected (is there more training required, is the user interface a confusing one)?
- Why did it take so long to correct (is there no playbook for detecting / reversing such a message or key account activity)?
- Why does the system allow such an impactful event to be performed unilaterally, by a single person (what safeguards should exist requiring more than one set of hands?)
- Why does this particular person have such authorization to perform this action (should a non-employee have the ability to delete such a verified, popular and influential account)?
- Why was the possibility of this incident not anticipated and prevented (why were Twitter employee requests for better safeguards ignored for years, why wasn’t the ease of making such a mistake recognized and what other similar mistake opportunities are there)?
Both of these incidents have had an impact far beyond those directly affected (Hawaiian inhabitants or Trump Twitter followers), and have shed light on the need to recognize the world has changed and policies and practices of old might not be enough for today. The ballistic missile false alarm revealed that more controls need to be placed on all mass communication, but also that Hawaii (or anywhere/anyone else) is extremely unprepared for the unthinkable. The use of Twitter as a channel for the President now raises questions over the validity of it as a Presidential record, asks who should control such a channel, and raises concerns on what security is around the President’s account?
Ask 5 Whys, look beyond the immediate impact to find collateral learnings, and take notice of all that an incident can reveal.
AKF Partners have been brought in by over 400 companies to avoid such incidents, and when they do occur, to learn from them. Let us help you.
Subscribe to the AKF Newsletter
January 3, 2018 | Posted By: AKF
One of the most common questions we get is “What are the most common failures you see tech and product teams make?”. To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to design for rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW.
2) Confusing product release with product success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy.
3) Insular product development/engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work”.
4) Over engineering the solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
5) Allowing history to repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Scaling through 3d parties
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to find your mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering. Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “big bang” fixes
In our experiences, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc then the product of all of the service calls is your theoretical availability. 5 calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to create and incent a culture of excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth.
11) Under-engineering for scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now. That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point on relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A new PDLC will fix my problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs.
14) We cannot hire great people quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) It is a SPOF (Single Point of Failure) but we can recover it onto another host quickly
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS based company, the site and services provided is the company’s sole source of revenue. Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
18) No Product Management team or person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) It is okay to bring the site down to roll code
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down. It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Like this article? Share it with friends here, and subscribe to the newsletter here.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Subscribe to the AKF Newsletter
December 14, 2017 | Posted By: Marty Abbott
The Law that Almost Wasn’t
Conway’s law had a rather precarious beginning. Harvard Business Review rejected Conway’s thesis, buried as it was in the 43d paragraph of a 45-paragraph paper, on the grounds that he had not proven it.
But Mel had a PhD in Mathematics (from Case Western Reserve University – Go Spartans!), and like most PhDs he was accustomed to journal rejections. Mel resubmitted the paper to Datamation, a well-respected IT journal of the time, and his paper “How Do Committees Invent” was published in 1968.
It wasn’t until 1975, however, that the moniker “Conway’s Law” came to be. Fred Brooks both coined the term and popularized Conway’s thesis in his first edition of the Mythical Man Month. It has since been one of the most widely cited, important but nevertheless incorrectly understood and applied notions in the domain of product development.
Cliff’s Notes to “How Do Committees Invent” (the article in which the law resides)
Conway’s thesis, in his words:
… organizations which design systems (in the broad sense used here) are constrained to produce designs which are copies of the communication structures of these organizations.
Conway calls this self-similarity between organizations and designs homomorphism. Preamble to the thesis helps explain the breadth and depth:
… the very act of organizing a design team means that certain design decisions have already been made, explicitly or otherwise
Every time a delegation is made … the class of design alternatives which can be effectively pursued is also narrowed.
Because the design which occurs first is almost never the best possible, the prevailing system concept may need to change. Therefore, flexibility of organization is important to effective design.
Specifically, each individual must have at most one superior and at most approximately seven subordinates
Examples. A contract research organization had eight people who were to produce a COBOL and an ALGOL compiler. After some initial estimates of difficulty and time, five people were assigned to the COBOL job and three to the ALGOL job. The resulting COBOL compiler ran in five phases, the ALG0L compiler ran in three.
There are 4 very important points, and one very good example, in the quotes above:
1) Organizations and design/architecture and intrinsically linked. The organization affects and constrains the architecture - the opposite is not true.
2) Depth of an organization negatively effects design flexibility. The deeper the hierarchy of an organization, the less flexible (or alternatively more constrained) the resulting architecture.
3) We will make mistakes and must organize to quickly fix these.
4) Team size should always be small – which also has an implication to the size of the solution part a team can own (think Amazon’s re-branding of this point of the “2 Pizza Team” (author’s side note – read Scalability Rules for how this came about).
Important corollaries to Conway’s law suggest that if either an organization or a design change, without a corresponding change to the other, the product will be at risk.
Common Failures in Application of Conway’s Law and How to Fix Them
There are five very common failures in organization and architecture within our clients, the first four of which relate directly to Conway’s points above:
1) Organizations and architectures designed separately. Given the homomorphism that Conway describes, you simply CANNOT do this.
2) Deep, hierarchical organizations. Again – this will constrain design.
3) Lack of flexibility. Companies tend to plan for success. Instead, assume failure, learning, and adaptation (think “discovery” and “Agile” instead of “requirements” and “Waterfall”).
4) Large teams. Forget about these. Small teams, each owning a service or services that the team can support in isolation.
There is a fifth violation that is harder to see in Conway’s paper. Too often, our clients don’t build properly experienced teams around the solutions they deploy. Success in low-overhead organizations requires that teams be cross functional. Whatever a team needs to be successful should be within that team. If you deploy on your own hardware, you should have hardware experience. If you need DBA talent, the team should have direct access to that talent. QA folks should be embedded within the team, etc. Product managers or owners should also be embedded in the team. This creates our fifth failure:
5) Functional teams. Don’t build teams around “a skill” – build them around the breadth of skills necessary to accomplish the task handed to the team.
Conway’s Parting Shot and Food for Thought
Noodle on this: Conway identified a problem early in the life of a new domain. Yet what was true in Conway’s time as a contributor to the art is still true today, over 50 years after his first attempt to forewarn us:
Probably the greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.
Like this article? Share it with friends here, and subscribe to the newsletter here.
AKF Partners helps companies ensure that their organizations and architectures are aligned to the outcomes they desire. We help companies develop better, more highly available and more highly scalable products with faster time to market and lower cost. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
Reach out to AKF
Subscribe to the AKF Newsletter
November 8, 2017 | Posted By: AKF
In 1965, psychologist Bruce Tuckman published his theory of group dynamics. This theory describes the stages (or phases) through which a team progresses enroute to optimal productivity. While generally useful for any organization, and prescriptive as to what leaders should do when to boost performance, it has profound impacts to Agile development practices and how we build organizations around these Agile practices.
The first stage is forming. This is where the team first comes together. Here, the individuals are trying to get to know each other. They tend to be polite and cordial, but they do not fully trust each other.
In this stage, the team productivity and team conflict are low. The team spends time agreeing to what the team is supposed to do. This lack of agreement of the team’s purpose can cause members to miss goals because they are individually targeting different things. Team members rely on patterned behavior and look to the team leader for guidance and direction. The team members want to be accepted by the group. Cautious behavior on the part of the team starts to depress overall team outcomes. Good leadership, emphasizing goals and outcomes is important to set the stage for future team behaviors and outcomes.
Once the team’s goals are clear, they move into the next stage, storming. Here, the team starts to develop a plan to achieve the goal and defines what to do and who does it. Friction starts to occur as members propose different ideas. Trust within the team remains low and affective conflict rises as people vie for control. Cliques can form. Productivity drops even lower than in the first stage.
Once the team agrees on the plan and the roles and responsibilities, it can move to the next stage. Without agreement, the team can get stuck. Symptoms include poor coordination, people doing the wrong things and missing deadlines, to name a few. Good leadership here focuses on fast affective conflict resolution, and serves to help reinforce team goals and outcomes in order to quickly move to more productive phases.
Once team members agree to the plan and understand their roles, they enter the norming phase. Affective conflict goes down, cognitive (beneficial) conflict and trust increase. The team focuses on how to get things done and productivity begins to increase. The team develops “norms” about how to work together and collaborate. A lack of these norms can cause issues such as low quality and missed deadlines.
Leadership within the team becomes clear and cliques dissolve. Members begin to identify with one another and the level of trust in their personal relationships contributes to the development of group cohesion. The team begins to experience a sense of group belonging and a feeling of relief from resolving interpersonal conflicts. Team identity starts to take hold and innovation and creativity within the team increases. The members feel an openness and cohesion on both a personal and task level. They feel good about being part of the team.
The final stage, preforming, is not achieved by all teams. This stage is marked by an interdependence in personal relations and problem solving within the realm of the team’s tasks. Team members share a common goal, understand the plan to achieve it, know their roles and how to work together. The team is firing on all cylinders. At this point, the team is highly productive and collaborates well. They are trusting of each other and “have each other’s back.” Healthy conflict is encouraged. There is unity: group identity is complete, group morale is high, and group loyalty is intense.
Not all teams get to this phase. They can get stuck in a previous phase or slide back into them from a higher phase. Leadership that focuses on affective conflict resolution, team identity creation, a compelling vision and goals to achieve that vision is critical to reaching the Performing phase. It is usually not easy for teams to quickly progress through these stages, and it often takes 6 months or more for a team to reach the Performing phase.
Impact to Agile Development
We often see companies make the mistake of coalescing teams around initiatives. Sometimes called “virtual teams” or “matrixed teams”, these teams suffer the underperforming phases of Tuckman’s curve repeatedly, especially when these initiatives are of durations shorter than 6 months. But even with durations of a year, six months of that time is spent getting the team to an optimum level of performance.
Tuckman’s analysis indicates that teams should be together for no less than a year (giving a 6 month return on a 6-month investment) and ideally for about 3 years. The upper limit being informed by the research on group think and its implications to creativity, performance and innovation within teams. Teams then should become semi-permanent and we should seek to move work to teams rather than form teams around work. To be successful here, we need multi-disciplinary teams capable of handling all the work they may get assigned. Further, the team needs to be familiar and “own” the outcomes associated with the solution (or architectural components) with which they work. More on that in future articles discussing Conway’s Law and Empathy Groups.
AKF Partners helps companies understand and apply the extant theory around organizational development in order to turbo-charge engineering performance. Wondering if your engineering productivity decreases as you grow your engineering and product teams? We can help you fix that and get your productivity back to the level it was as a startup!
Reach out to AKF
Subscribe to the AKF Newsletter
October 24, 2017 | Posted By: Marty Abbott
North Korea’s recent antics involving ballistic missiles and nuclear weapons are scary. While we seem to be edging ever closer to nuclear war – closer perhaps than any time since the Cuban Missile Crisis – the probability of such an occurrence remains relatively low. Even an apparently irrational head of state such as Kim Jong Un must understand that the use of a nuclear device will turn nearly the entire world against him. The use of a device against any nation would end his reign in relatively short order and end the People’s Republic of North Korea as we know it today. This then begs the question of why Kim Jong Un would participate in such brinkmanship? Many politicians and strategists seem to think it is a strategy to force other nations to recognize the PRNK and reduce the onerous sanctions currently levied against it by the United Nations. Perhaps, but maybe in addition to or instead, Jong Un is trying to take our eyes off the war he has been waging for many years: a cyber war against many nations.
Both cyber warfare on the part of a nation state and cyber terrorism waged by stateless entities aim to attack our economic infrastructure. Both North Korea and terrorists understand that attacking our economy, our businesses and our personal wealth are the most effective methods of causing harm to our nations and their citizens. North Korea is likely behind many recent attacks on financial institutions, has ties to the WannaCry ransomware outbreak, was behind the attack on Sony pictures and was involved in a heist of $81M from the Bangladesh Central Bank. Each of these were likely perpetrated by the formidable PRNK cyber warfare group “Unit 180”.
When not engaging in direct attacks to steal money from or otherwise harm business operations, both terrorists and nation states seek to use the products of a company for nefarious purposes. Recent examples include ISIS using eBay’s marketplace to funnel money to an operative in the US, and Russia purchasing advertising on Facebook in an attempt to influence the US election. Cyber warfare and terrorism are not just threats– they are daily occurrences. The foregoing examples illustrate how the game has changed. The question for you is - has your company changed enough to successfully protect itself against this growing and evolving threat?
The answer for most companies with which we work is “No”. Security organizations seem oblivious to the changing cyber threat. They continue to focus almost exclusively on barrier protection systems and cyber response processes. Few companies outside of the financial sector have developed analytics systems to help identify emerging threats and nefarious activity. Fewer still practice aggressive “patrolling” to identify threats outside of the perimeter of their digital operations. Here are a few questions to help you evaluate whether your company has the mindset necessary to be successful in the world of cyberwarfare and terrorism:
Who means you harm and how do they intend to perpetrate it?
Military veterans know that a successful defense requires more than just “Alamo’ing Up” behind a wall and hunkering down. You must patrol and reconnoiter the surrounding area to understand whence the enemy will come, in what numbers and with what capabilities. If your security team isn’t actively attempting to identify threats outside of your organization - and by this I mean beyond your walls - you are most certainly going to be surprised.
How do you find new and emerging behavior within your product and operations?
Given the threat of using your product for nefarious purposes, how do you identify when new and behaviors or trends emerge? What analytics systems do you have to identify that existing personas or users are acting in new or odd ways? How do you keep an eye on new patterns or trends of usage by both existing and new users? In very high transaction environments, how do you identify the less than 1 basis point of activity that may be nefarious in nature buried within 99.99% of valid transactions? These questions aren’t likely to be answered by a “traditional” security team – they require teams with deep analytic skills and systems dedicated to analytics and machine learning. Similarly, traditional analytics teams may not have the right mindset to seek out nefarious transactions.
Do you have the right people?
This is the most important question of all. You don’t need to fire your CSSP folks – you still have a need for them within your security team. But you also want folks with a proven record of being able to think like and use the tools of cyber criminals, terrorists, and warfare focused nation-states. These folks are unlikely to be willing to wear suits and ties to work, preferring instead to wear shorts and Birkenstocks. The traditional corporate mindset and tools will stand in the way of them being successful on your behalf. They need to use TOR browsers and have access to sites to which you are unlikely to want the remainder of your employees going. The biggest barrier to success here with most companies is fit with a company’s culture – but I can guarantee you that if you don’t have some of these folks on staff you are not going to be successful in this new era of cyber warfare.
How do you fare against the above questions? Are you properly set up to defend your company and your shareholders against the today’s cyber threat? If you are uncertain, reach out to AKF Partners – we’ll evaluate your security infrastructure and approach and help ensure that you can properly defend yourself against the growing threat.
Subscribe to the AKF Newsletter
October 15, 2017 | Posted By: Marty Abbott
I’m a huge Malcolm Gladwell fan. Gladwell’s ability to convey complex concepts and virtually incomprehensible academic research in easily understood prose is second to none within his field of journalism. A perfect example of his skill is on display in the Tipping Point, where Gladwell wrestles the topic of Complexity Theory (aka Chaos Theory) into submission, making it accessible to all of us. In the Tipping Point, Gladwell also introduces us to The Broken Windows Theory.
The Broken Windows Theory gets its name from a 1982 The Atlantic Monthly article. This article asked the reader to imagine a building with a few broken windows. The authors claim that the existence of these windows invite vandals to break still more windows. A continuous cycle of expanding vandalism ensues, with squatters moving in, nearby buildings getting vandalized, etc. Subsequent authors expanded upon the theory, claiming that the presence of vandalism invites other crimes and that crime rates soar in communities where unhandled vandalism is present. A corollary to the Broken Windows Theory is that cities can reduce crime rates by focusing law enforcement on petty crimes. Several high profile examples seem to illustrate the power and correctness of this theory, such as New York Mayor Giuliani’s “Zero Tolerance Program”. The program focused on vandalism, public drinking, public urination, and subway fare evasion. Crime rates dropped over a 10 year period, corresponding with the initiation of the program. Several other cities and other experiments showed similar effects. Proof that the hypothesis underpinning the theory is correct.
Not So Fast…
Enter the self-described “Rogue Economist” Stephen Levitt and his co-author Stephen Dubner - both of Freakonomics fame. While the two authors don’t deny that the Broken Windows theory may explain some drop in crime, they do cast significant doubt on the approach as the primary explanation for crime rates dropping. Crime rates dropped nationally during the same 10 year period in which New York pursued its Zero Tolerance Program. This national drop in crime occurred in cities that both practiced Broken Windows and those that did not. Further, crime rate dropped irrespective of either an increase or decrease in police spending. The explanation therefore, argue the authors, cannot primarily be Broken Windows. The most likely explanation and most highly correlated variable is a reduction in a pool of potential criminals. Roe v. Wade legalized abortion, and as a result there was a significant decrease in the number of unwanted children, a disproportionately high percentage of whom would grow up to be criminals.
Gladwell isn’t therefore incorrect in proffering Broken Windows as an explanation for reduction in crime. But the explanation is not the best one available and as a result it holds residence somewhere between misleading (worst case) and incomplete (best case).
To be fair, it’s hard to hold Gladwell accountable for this oversight. Gladwell is not a scientist and therefore not trained in how to scientifically evaluate the research he reported. Furthermore, his is an oft repeated mistake even among highly trained researchers. And what exactly is that mistake? The mistake made here is illustrated by the difference in approach between the Broken Windows researchers and the Freakonomics authors. The Broken Windows researchers started with something like the following question “Does the presence of vandalism invite additional vandalism and escalating crime?” Levitt and Dubner first asked the question “What variables appear to explain the rate of crime?”
Broken Windows started with a question focused on deductive analysis. Deduction starts with a hypothesis - “Evidence of vandalism and/or other petty crimes invites similar and more egregious crimes”. The process continues to attempt to confirm or disprove the hypothesis. Deduction starts with a broad and abstract view of the data – a generalization or hypothesis as to relationships – and attempts to move to show specific relationships between data elements. The Broken Windows folks started with a hypothesis, developed a series of experiments to test the hypothesis and then ultimately evaluated time series data in cities with various Broken Windows approaches to policing. What they lacked was a broad question that may have developed a range of options indicating possible causes.
The Freakonomics authors started with an inductive question. Induction is the process of moving from specific observations about data into generalizations. These generalizations are often in the form of hypothesis or models as to how data interacts. Induction helps to inform what questions should be asked of the data. Induction is the asking of “What change in what independent variables seem to correspond with a resulting change in some dependent variable?” Whereas deduction works from independent variable to dependent variable, induction attempts to work backwards from dependent variable to identify independent variable relationships.
The jump to deduction, without forming the right questions and hypotheses through induction, is the biggest mistake we see in developing Big Data programs and implementing Big Data solutions. We all approach problems with unique experiences and unique biases. The combination of these often cause us to race to hypotheses and want to test them. The issue here is two-fold. The best case is that we develop an incomplete (and as a result partially or mostly incorrect) answer similar to that of The Broken Windows researchers. The worst case is that we suffer what statisticians call a Type 1 error – confirming an incorrect answer. The probability of type 1 errors increases when we don’t look for alternative or better answers for outcomes within our data sets. Induction helps to uncover those alternative or supporting explanations. Exploring the data to discover potential relationships helps us to ask the right questions and form better hypotheses and better models. Skipping induction makes it highly probable that we will get an incorrect, misleading or substandard answer.
But it is not enough to simply ensure that we practice both induction and deduction. We must also recognize that the solutions that support induction are different from those that support deduction. Further, we must understand that the two processes while complimentary can actually interfere with each other when performed on the same system. Induction is necessarily a very broad and as a result slow and tedious process. Deduction, on the other hand, needs significantly less data and “prefers” to be faster in implementation. Inductive systems are best supported by solutions that impose very few relations or structure on the data we observe. Systems that support deduction, in order to allow for faster response times, impose increased structure relative to inductive systems. While the two phases of discovery (Induction and Deduction) support each other, their differences suggest that they should be performed on solutions purpose built to their specific needs.
Similarly, not everyone is equally qualified to perform both induction and deduction. Our experience is that the folks who tend to be good at determining how to prove relationships between variables are often not as good at identifying patterns and vice versa.
These two observations, that the systems that support induction and deduction should be separated and that the people performing these tasks may need to be different, have ramifications to how we develop our analytics systems and organize our Big Data teams. We’ll discuss these ramifications and more in our next post, “10 Anti-Patterns within Big Data”.
Subscribe to the AKF Newsletter
1 2 3 >