Subscribe to DSC Newsletter

7 Traps to Avoid Being Fooled by Statistical Randomness

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere -- if a process is truly random, then it is not predictable, in the analytic sense of that term.  Randomness refers to the absence of patterns, order, coherence, and predictability in a system. 

Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such "ordered" patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: "Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory." The message was clear -- beware of apparent order in a random process, and don't be tricked into developing a theory to explain random data.

One way that randomness is most likely to induce a reduction in rational thinking is in small-numbers phenomena. For example, suppose that I ask 12 people which American NFL football team that they like the most, and they all say Baltimore Ravens. Is that a statistical fluke, a fair statement about the national sentiment, or a selection effect (since all 12 people that I asked actually live in Baltimore)? The answer is probably the latter. Okay, this example may be too obvious. So, consider the following less obvious example:

Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin).  Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?

(a) HTHTHTHTHTHH

(b) TTTTTTTTTTTT

(c) HHHHHHHHHHHT

(d) None of the above.

In each case, a coin toss of head is listed as "H", and a coin toss of tail is listed as "T". 

The answer is "(d) None of the Above."

None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here -- it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any "improbable result" may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.

So, let's try again with another sample problem (#2) in which I truly did invent one of the three sequences (i.e., a bogus sequence that I manually typed on the computer, attempting to create my own example of a random sequence).  Which one of these 50-coin toss sequences is the bogus sequence?

(a) HTHHTHHTTHHTTTHTHTHHHTHTHTHHHTTHTTTHTHTHHTTHTHTHTT

(b) HHHHHHTHTHHHHHTTTHTTTTHTTHHHHTHHHHHTHTTHHHTHHHHHHH

(c) THTTTTTTHTTTTTTTTHHHTTTTHHTTTTHHHTHHTTHHTTTTTHTTHH

For the two real (non-bogus) sequences, I used a random number generator to generate the 50-coin sequence. The random number generator (common to nearly all scientific programming environments) produces a random number between 0 and 1. I simply labeled the event as "H" when the number was 0.5 or greater, and labeled the event as "T" whenever the number was less than 0.5.

The answer to sample problem #2 is ... posted at the bottom of this post (by which point you will have probably guessed it).

This topic of "fooled by randomness" came up when I was reading an article recently on the  Turing Award Winners from 1966 through 2013.

This article lists many interesting statistical facts about the 61 winners of the award. The article provides a fun, interactive data visualization built with Tableau tools in which you can explore these statistical data, which include: each winner's birth year, age at time of award, nationality, gender, and... astrological sign! Being a data scientist and astrophysicist, I found the inclusion of Zodiac sign to be disconcerting. However, the author of the original post does admit that this was included jokingly. 

As you look at the data, you will see that 10 of the 61 Turing Award winners were born under one specific sign of the Zodiac, and only 2 of the 61 winners were born under another sign (in fact, two such examples exist). These questions then arise: Is there significance to this apparent correlation? Is there true order here, and not randomness? Are Capricorns really five times more likely to win future Turing Awards than Scorpios?

Of course, the response to these questions is that the statistical distribution of astrological birth signs does truly represent a purely random process, with no astrological (or astronomical) significance whatsoever. But, to prove this fact, it appeared to be a fun exercise for my random number generator once again.

So, I generated random birth months (1 through 12, corresponding equivalently to the 12 signs of the Zodiac) for 61 individuals.  (For simplicity, we assumed that all birth months are equally likely, thus ignoring the variable length of the various months.)  I repeated this simulation 100,000 times (which almost certainly falls into that scientific data analysis category of "overkill"). I then examined how many times in the 100,000 simulations did some of the following apparent correlations exist:

(1) We find 10 or more of the 61 individuals with the same birth month (astrological sign):

Answer: in 32% of the simulations

(2) We find 2 or fewer of the 61 individuals in any one of the birth months:

Answer: in 80% of the simulations

(3) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 5 or greater:

Answer: in 40% of the simulations

(4) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 4.5 or greater:

Answer: in 49% of the simulations

Therefore, it is statistically reasonable and totally expected that we would see 1 or 2 birth months that contain only two award winners.  It is also statistically reasonable that we could see 5 times as many winners in the most populous month as in the least populous month. Regarding the first correlation (32% of the simulations revealing 10 or more of the 61 individuals with the same birth month), 32% is a non-trivial percentage and therefore not surprising that we see it occur in real life.

What conclusions can we draw from all of this discussion of "fooled by randomness"? What are the traps that we can fall into?

  • We often tend to pick out and focus on the "most interesting" results in our data, and ignore the uninteresting cases. This is selection bias, and also is an example of "a posteriori" statistics (derived from observed facts, not from logical principles).
  • It is easy to be fooled by randomness, especially in our rush to build predictive analytics models that actually predict interesting outcomes.
  • This is similar to the birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., a repeating pattern in random data).
  • Humans are good at seeing patterns and correlations in data, but correlation does not imply causation.
  • The bigger the data set, the more likely you will see an "unlikely" pattern!
  • What we see in the Turing Awards data is evidence of a "small-numbers phenomenon."
  • When asked to pick the "random" statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse "randomness" with the "appearance of randomness". A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior: lots of non-repeating values, but few large repetitions (i.e., we forget to take into account the long tail of the distribution). For example, in sample problem #1 above, the answer (b) sequence of 11 T's after the initial T has a statistical likelihood of 1 part in 2^11 (once in 2048 twelve-toss subsequences), which is rare but it still occurred in my real experiment!

So, this brings us back to our sample problem #2, whose correct answer is: (a).

If that answer surprises us, it is because when we generate random sequences manually (without the aid of an objective unbiased algorithm), or when we try to judge if a data string is a random sequence, we are prone to falling into some of the traps listed above.

If you want to read more about this interesting topic, and to avoid being fooled by randomness in your data science and data analysis activities, then I recommend the following three books: 
  1. "Fooled By Randomness", by Nassim Nicholas Taleb.
  2. "The Flaw of Averages", by Sam L. Savage.
  3. "The Drunkard's Walk - How Randomness Rules Our Lives", by Leonard Mlodinow.

Views: 10430

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Kirk Borne on January 21, 2015 at 1:18pm

@Dimitrios, That is a great comment! Thank you for adding your insights and discussion of hidden statistical bias (hidden variables) that can adversely induce unscientific conclusions from some individuals.

Comment by Dimitrios Geromichalos on January 21, 2015 at 3:53am

Very interesting article. I wanted only to add that there can be indeed some real Zodiac correlations which have nothing to do with astrology, of course.

One example is the birth date of e.g. German soccer players who tend to be born more at the beginning than at the end of the year (i.e. more likely "Aquarius" or "Pisces" than "Scorpio" or "Sagittarius"). The reason for that is that talented children are trained in groups according to their age with a cut-off date of January the first. The developmental edge of some months is quite significant for children. So, children born at the beginning of the year appear to be more talented and are more supported. This has an effect for their whole following career and is known as "Relative Age Effect".

I think it is very important to know about such effects since astrologers tend to misuse them for their own "theories".

(See article (in German): http://www.zeit.de/sport/2013-06/dfb-u21-nachwuchsfussball-dezember...)

Comment by Kirk Borne on January 12, 2015 at 1:30pm

@William, Thanks for your comments. My point is that astrology is a diversionary past-time (at best), not science. And science (through modeling and simulation) can reproduce the distribution of birth months for the Turing Award winners. Nevertheless, I do agree with you that one's age relative to your peer group when you start school is important in early development, but that difference fades with time, especially as inherent aptitudes (for sports, science, math, art, languages, innovation, etc.) start to emerge.

Comment by William Shearin on January 12, 2015 at 1:11pm

Sadly, you have missed a point on the "randomness" of astrological signs.  It has been well documented that your age relative to your peer group on starting school is related to sports performance.  The determinant of your relative age in your grade is (unsurprisingly) when in the calendar year you were born.  It may be causally true that certain Zodiac signs over/under represent in the Turing awards as the  birth date effect (although possibly different months) is highly present in the NFL.  I don't claim astrology is useful; I'm just pointing out that there is an impact of the month of your birth.

Comment by Donald Costello on January 10, 2015 at 6:55pm

You capture some of the traps... all traps in logical statistical thinking but there is at least one other. When we want a rn from a distribution we are often dealing with long tailed distributions. This roughly means that we have the possibility of certain events being very rare.

However when we are sampling on HOC machines wharves rare can and waiting long enough, short in HPC arena, occurs.

This kind of sampling was very rare in slower CPU environments.

One has to plan for the fact that rare random events can and do occur.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2016   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service