I’m a huge Malcolm Gladwell fan.  Gladwell’s ability to convey complex concepts and virtually incomprehensible academic research in easily understood prose is second to none within his field of journalism.  A perfect example of his skill is on display in the Tipping Point, where Gladwell wrestles the topic of Complexity Theory (aka Chaos Theory) into submission, making it accessible to all of us.  In the Tipping Point, Gladwell also introduces us to The Broken Windows Theory.

The Broken Windows Theory gets its name from a 1982 The Atlantic Monthly article.  This article asked the reader to imagine a building with a few broken windows.  The authors claim that the existence of these windows invite vandals to break still more windows.  A continuous cycle of expanding vandalism ensues, with squatters moving in, nearby buildings getting vandalized, etc.  Subsequent authors expanded upon the theory, claiming that the presence of vandalism invites other crimes and that crime rates soar in communities where unhandled vandalism is present.  A corollary to the Broken Windows Theory is that cities can reduce crime rates by focusing law enforcement on petty crimes.  Several high profile examples seem to illustrate the power and correctness of this theory, such as New York Mayor Giuliani's “Zero Tolerance Program”.  The program focused on vandalism, public drinking, public urination, and subway fare evasion.  Crime rates dropped over a 10 year period, corresponding with the initiation of the program.  Several other cities and other experiments showed similar effects.  Proof that the hypothesis underpinning the theory is correct.

Not So Fast…

Enter the self-described “Rogue Economist” Stephen Levitt and his co-author Stephen Dubner - both of Freakonomics fame.  While the two authors don’t deny that the Broken Windows theory may explain some drop in crime, they do cast significant doubt on the approach as the primary explanation for crime rates dropping.  Crime rates dropped nationally during the same 10 year period in which New York pursued its Zero Tolerance Program.  This national drop in crime occurred in cities that both practiced Broken Windows and those that did not.  Further, crime rate dropped irrespective of either an increase or decrease in police spending.  The explanation therefore, argue the authors, cannot primarily be Broken Windows.  The most likely explanation and most highly correlated variable is a reduction in a pool of potential criminals.   Roe v. Wade legalized abortion, and as a result there was a significant decrease in the number of unwanted children, a disproportionately high percentage of whom would grow up to be criminals.

Gladwell isn’t therefore incorrect in proffering Broken Windows as an explanation for reduction in crime.  But the explanation is not the best one available and as a result it holds residence somewhere between misleading (worst case) and incomplete (best case).

What Happened?

To be fair, it’s hard to hold Gladwell accountable for this oversight.  Gladwell is not a scientist and therefore not trained in how to scientifically evaluate the research he reported.  Furthermore, his is an oft repeated mistake even among highly trained researchers.  And what exactly is that mistake?  The mistake made here is illustrated by the difference in approach between the Broken Windows researchers and the Freakonomics authors.  The Broken Windows researchers started with something like the following question “Does the presence of vandalism invite additional vandalism and escalating crime?”  Levitt and Dubner first asked the question “What variables appear to explain the rate of crime?”

Broken Windows started with a question focused on deductive analysis.  Deduction starts with a hypothesis - “Evidence of vandalism and/or other petty crimes invites similar and more egregious crimes”.   The process continues to attempt to confirm or disprove the hypothesis.  Deduction starts with a broad and abstract view of the data – a generalization or hypothesis as to relationships – and attempts to move to show specific relationships between data elements.  The Broken Windows folks started with a hypothesis, developed a series of experiments to test the hypothesis and then ultimately evaluated time series data in cities with various Broken Windows approaches to policing.  What they lacked was a broad question that may have developed a range of options indicating possible causes.

The Freakonomics authors started with an inductive question.  Induction is the process of moving from specific observations about data into generalizations.  These generalizations are often in the form of hypothesis or models as to how data interacts.  Induction helps to inform what questions should be asked of the data.  Induction is the asking of “What change in what independent variables seem to correspond with a resulting change in some dependent variable?”  Whereas deduction works from independent variable to dependent variable, induction attempts to work backwards from dependent variable to identify independent variable relationships.

So What?

The jump to deduction, without forming the right questions and hypotheses through induction, is the biggest mistake we see in developing Big Data programs and implementing Big Data solutions.   We all approach problems with unique experiences and unique biases.  The combination of these often cause us to race to hypotheses and want to test them.  The issue here is two-fold. The best case is that we develop an incomplete (and as a result partially or mostly incorrect) answer similar to that of The Broken Windows researchers.  The worst case is that we suffer what statisticians call a Type 1 error – confirming an incorrect answer.  The probability of type 1 errors increases when we don’t look for alternative or better answers for outcomes within our data sets.  Induction helps to uncover those alternative or supporting explanations.  Exploring the data to discover potential relationships helps us to ask the right questions and form better hypotheses and better models.  Skipping induction makes it highly probable that we will get an incorrect, misleading or substandard answer.

But it is not enough to simply ensure that we practice both induction and deduction.  We must also recognize that the solutions that support induction are different from those that support deduction.  Further, we must understand that the two processes while complimentary can actually interfere with each other when performed on the same system.  Induction is necessarily a very broad and as a result slow and tedious process.  Deduction, on the other hand, needs significantly less data and “prefers” to be faster in implementation.  Inductive systems are best supported by solutions that impose very few relations or structure on the data we observe.  Systems that support deduction, in order to allow for faster response times, impose increased structure relative to inductive systems.  While the two phases of discovery (Induction and Deduction) support each other, their differences suggest that they should be performed on solutions purpose built to their specific needs.

Similarly, not everyone is equally qualified to perform both induction and deduction.  Our experience is that the folks who tend to be good at determining how to prove relationships between variables are often not as good at identifying patterns and vice versa.

These two observations, that the systems that support induction and deduction should be separated and that the people performing these tasks may need to be different, have ramifications to how we develop our analytics systems and organize our Big Data teams.  We’ll discuss these ramifications and more in our next post, “10 Anti-Patterns within Big Data”.