Subscribe to Dr. Granville's Weekly Digest

This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.

In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be

  • entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
  • not replicable,
  • having no predictive power,
  • but obscuring weaker patterns that are ignored yet have a strong predictive power.

The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.

To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.

Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.

Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?

Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m], that is, 1 minus (1-p) at power m.

For instance,

  • If n=20, m=10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a=0.80) is 90.93%.
  • If n=20, m=100,000 (still a relatively small value for m), then the chance that your conclusion is VERY wrong (that is, a=0.90) is 98.17%.

Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.

Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.

Exercise:

Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?

Related articles:

Views: 24461

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Mark L. Stone on December 21, 2013 at 9:27pm

Quite apart form whether variance or standard deviation is sensitive to outliers, there are very specific calculations which require it, and if they do, it should be calculated in a numerically stable manner.

Whoever wrote that variance computation code in Mahout is like a grade school child using his lunch knife to perform brain surgery - it may not come out well, but (s)he does know how to use a knife after all, so (s)he is qualified, so (s)he thinks.

Comment by Vincent Granville on December 21, 2013 at 8:47pm

@Mark: I would also question the use of E(X)^2-E(X^2), no matter how stable the computation is. Variance uses squares, which tends to make this metric sensitive to outliers, essentially squaring big deviations. In large data sets, there will be some big outliers, and they will render this metric useless.

Comment by Mark L. Stone on December 21, 2013 at 6:26pm

Following up on Mirko's comment, has the numerically unstable computation of sample variance in Hadoop/Mahout by E(X)^2-E(X^2) been replaced by a numerically stable computation yet?  Numerically stable one-pass methods for computing the sample variance have been known for 51 years, see  http://webmail.cs.yale.edu/publications/techreports/tr222.pdf , and are readily parallelizable.  Nevertheless, Hadoop/Mahout is undoubtedly an excellent tool if your goal is to very rapidly calculate the wrong answer on massive data sets.

The numerical method used to calculate sample variance is THE first thing I check when given access to source code.  And hey sports fans, if the program authors don't get that right, it's usually just the tip of the iceberg, and symptomatic of software written by person(s) illiterate in numerical mathematical computation.  The finite precision nature of floating point computation must be taken into account if effective and numerically stable and reliable software is to be produced.  If the computations were done in higher precision, such as quad precision, the numerical instability "day of reckoning" can be pushed out a ways, but eventually it comes, and when it does, things go downhill in a hurry.

For a given number of floating point digits, all else being equal, the larger the data set, the less you can get away with using numerically unstable algorithms. Ideally, calculations on big data should be done in quad precision using numerically stable algorithms, and barring that, performed in double precision using numerically stable algorithms.  Otherwise, people are just maximizing the speed at which they produce unreliable answers.  You might think I'm just being a nitpicker, as real data is only an approximation anyhow.  Really?  What would you think if numerical variance came out negative?  That's significant, and why to this day, I still see codes, some V V &A'd by the government for use in safety critical applications, which calculate standard deviation as sqrt(abs(variance)) or sqrt(max(variance, 0)), which is the "fix" someone put in after getting a sqrt of a negative argument error message.

Comment by BR Deshpande on August 16, 2013 at 1:38pm

Great article!

I agree totally that in a larger dataset the probabilities of finding spurious/accidental correlations are higher than in a smaller dataset. In this context, "larger" implies higher k. But when you state "curse of big data is very acute when n is smaller than 200", are we even working with big data? 

Comment by Vincent Granville on July 31, 2013 at 9:18pm

 I think the problem is two-fold: 

1) Statisticians have not been involved in the big data revolution. Some have written books such as applied data science, but it's just a repackaging of very old stuff, and has nothing to do with data science. Read my article on fake data science, athttp://www.analyticbridge.com/profiles/blogs/fake-data-science 

2) Methodologies that work for big data sets - as big data was defined back in 2005 (20 million rows would qualify back then) - miserably fail on post-2010 big data (terabytes). Read my article on the curse of big data, athttp://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data 

As a result, people think that data science is just statistics, with a new name. They are totally wrong on two points: they confuse data science and fake data science, and they confuse big data 2005 and big data 2013.

Comment by Mirko Krivanek on January 6, 2013 at 7:20pm

Other examples of misuses:

  1. Flaw in Google's keyword relevancy algorithms due to complexity of text data. When you enter a search query such as "mining data" on Google (that is, data about mining companies), it returns search results about "data mining": the search relevancy algorithm is flawed. This Google algorithm maps the user query to an internal Google indexed keyword (the keyword index is used to associate URLs with their relevant keywords). To create the mapping, the user query is first standardized: typos are fixed, unimportant words (or, the, and etc.) removed, plurals ignored, -ing are removed ("booking" could become "book" unless Google uses a table of words where -ing can't be removed), and finally only one combination among all n! potential n-grams of the indexed keyword (where n = number of tokens in indexed keyword) is stored in the keyword index table: it is the combination where all tokens are listed in alphabetical order. The solution consists of keeping all of the combinations (usually, 1, 2 or 3 at most, out of n!) with large volume: in the case of "data mining", both "data mining" and "mining data" should be kept in the keyword index, and thus treated separately.
  2. Highly unstable numerical computations in Hadoop/Mahout, reminding me of the inaccuracies in Excel statistical computations. If you look up the way Hadoop/Mahout computes the variance, it uses the naive E(X)^2-E(X^2) approach which is fundamentally flawed because it suffers from catastrophic cancellation. (See the numerical stability presentation at ICDE 2012). Nobody noticed or fixed this yet! It's  a serious flaw that the variance is numerically unstable if your data is not centered around 0 (such as in timestamps). Heck, it can even become negative.  There was too much attention to scalability and "volume", no "verification".
  3. Spam detection, user reviews and recommendations based on crowd sourcing. Most of these technologies lack "fake review" and "bogus spam report" detection algorithms, resulting in tons of false positives and false negatives. A bogus spam report is (for instance) a Gmail user flagging an email message as spam, by error or on purpose (to hurt a competitor). The reverse also happens: a Gmail spammer creating dozens of Gmail accounts, and flagging all his spam messages as "not spam" (that is, moving his spam from SpamBox to InBox in dozens of bogus Gmail accounts created for that very purpose, hoping it will clear spam flags on all Gmail accounts). Big data needs to be much smarter about detecting these simple tricks.

Good news: eventually these issues will be fixed. It's easy to fix them.

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service