Subscribe to Vincent Granville's Weekly Digest:

This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data. We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed.

In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be

  • entirely caused by chance (just like someone who wins at the lottery wins purely by chance) and
  • not replicable,
  • having no predictive power,
  • but obscuring weaker patterns that are ignored yet have a strong predictive power.

The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

Let's focus on one example: identifying strong correlations or relationships between time series. If you have 1,000 metrics (time series), you can compute 499,500 = 1,000*999/2 correlations. If you include cross-correlations with time lags (e.g. stock prices for IBM today with stock prices for Google 2 days ago), then we are dealing with many, many millions of correlations. Out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will loose. Keep in mind that analyzing cross-correlations on all metrics is one of the very first step statisticians do at a beginning of any project - it's part of the exploratory analysis step. However, a spectral analysis of normalized time series (instead of correlation analysis) provide a much more robust mechanism to identify true relationships.

To illustrate the issue, let's say that you have k time series, each with n observations, for instance, price deltas (price increases or decreases) computed for k different stock symbols with various time lags over a same time period consisting of n days. For instance, you want to detect patterns such as "When Google stock price goes up, Facebook goes down one day later". In order to detect such profitable patterns, you must compute cross-correlations over thousands of stocks, with various time lags: one day, two days, or maybe one second, two seconds depending on whether you do daily trading or extremely fast intraday, high frequency trading. Typically, you keep a small number of observations - e.g. n=10 days or n=10 milliseconds - as these patterns evaporate very fast (when your competitors detect the patterns in question, it stops becoming profitable). In other words, you can assume that n=10, or maybe n=20. In other cases based on monthly data (environmental statistics, emergence of a new disease), maybe n=48 (monthly data collected over a 2-year time period). In some cases n might be much larger, but in that case the curse of big data is no longer a problem. The curse of big data is very acute when n is smaller than 200 and k moderately large, say k=500. However, instances where both n is large (> 1,000) and k is large (> 5,000) are rather rare.

Now let's review a bit of mathematics to estimate the chance of being wrong when detecting a very high correlation. We could have done Monte Carlo simulations to compute the chance in question, but here we use plain old-fashioned statistical modeling.

Let's introduce a new parameter, denoted as m, representing the number of paired (bi-variate), independent time series selected out of the set of k time series at our disposal: we want to compute correlations for these m pairs of time series. Theoretical question: assuming you have m independent paired time series, each consisting of n numbers generated via a random number generator (an observation being e.g. a simulated normalized stock price at a given time for two different stocks), what are the chances that among the m correlation coefficients, at least one is higher than 0.80?

Under this design, the theoretical correlation coefficient (as opposed to the estimated correlation) is 0. To answer the question, let's assume (without loss of generality) that the time series (after a straightforward normalization) are Gaussian white noise. Then the estimated correlation coefficient, denoted as r, is (asymptotically, that is approximately when n is not small) normal with mean = 0 and variance = 1/(n-1). The probability that r is larger than a given large number a (say a=0.80, meaning a strong correlation) is p=P(r>a) with P representing a normal distribution with mean = 0 and variance = 1/(n-1). The probability that, among the m bivariate (paired) time series, at least one has of correlation above a=0.80 is thus equal to 1-[(1-p)^m] (1-p at power m).

For instance,

  • If n=20, m=10,000 (10,000 paired time series each with 20 observations), then the chance that your conclusion is wrong (that is, a=0.80) is 90.93%.
  • If n=20, m=100,000 (still a relatively small value for m), then the chance that your conclusion is VERY wrong (that is, a=0.90) is 98.17%.

Now in practice the way it works is as follows: you have k metrics or variables, each one providing a time series computed at n different time intervals. You compute all cross-correlations, that is m = k*(k-1)/2. However the assumption of independence between the m paired time series is now violated, thus concentrating correlations further away from a very high value such as a=0.90. But also, your data is not random numbers, it's not white noise. So the theoretical correlations are much above absolute 0, maybe around 0.15 when n=20. Also m will be much higher than (say) 10,000 or 100,000 even when you have as few as k=1,000 time series (say one time series for each stock price). These three factors (non independence, theoretical r different from 0, very large m) balance out and make my above computations still quite accurate when applied to a real typical big data problem. Note that I computed my probabilities using the online calculator stattrek.

Conclusion: hire the right data scientist before attacking big data problems. He/she does not need to be highly technical, but able to think in a way similar to my above argumentation, to identify possible causes of model failures before even writing down a statistical or computer science algorithm. Being a statistician helps, but you don't need to have advanced knowledge of stats. Being a computer scientist also helps to scale your algorithms and make them simple and efficient. Being a MBA analyst also helps to understand the problem that needs to be solved. Being the three types of guy at the same time is even far better. And yes these guys do exist and are not that rare.

Exercise:

Let's say you have 3 random variables X, Y, Z with corr(X,Y)=0.70, corr(X,Z)=0.80. What is the minimum value for corr(Y,Z). Can this correlation be negative?

Related articles:

Views: 9887

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Mirko Krivanek on January 6, 2013 at 7:20pm

Other examples of misuses:

  1. Flaw in Google's keyword relevancy algorithms due to complexity of text data. When you enter a search query such as "mining data" on Google (that is, data about mining companies), it returns search results about "data mining": the search relevancy algorithm is flawed. This Google algorithm maps the user query to an internal Google indexed keyword (the keyword index is used to associate URLs with their relevant keywords). To create the mapping, the user query is first standardized: typos are fixed, unimportant words (or, the, and etc.) removed, plurals ignored, -ing are removed ("booking" could become "book" unless Google uses a table of words where -ing can't be removed), and finally only one combination among all n! potential n-grams of the indexed keyword (where n = number of tokens in indexed keyword) is stored in the keyword index table: it is the combination where all tokens are listed in alphabetical order. The solution consists of keeping all of the combinations (usually, 1, 2 or 3 at most, out of n!) with large volume: in the case of "data mining", both "data mining" and "mining data" should be kept in the keyword index, and thus treated separately.
  2. Highly unstable numerical computations in Hadoop/Mahout, reminding me of the inaccuracies in Excel statistical computations. If you look up the way Hadoop/Mahout computes the variance, it uses the naive E(X)^2-E(X^2) approach which is fundamentally flawed because it suffers from catastrophic cancellation. (See the numerical stability presentation at ICDE 2012). Nobody noticed or fixed this yet! It's  a serious flaw that the variance is numerically unstable if your data is not centered around 0 (such as in timestamps). Heck, it can even become negative.  There was too much attention to scalability and "volume", no "verification".
  3. Spam detection, user reviews and recommendations based on crowd sourcing. Most of these technologies lack "fake review" and "bogus spam report" detection algorithms, resulting in tons of false positives and false negatives. A bogus spam report is (for instance) a Gmail user flagging an email message as spam, by error or on purpose (to hurt a competitor). The reverse also happens: a Gmail spammer creating dozens of Gmail accounts, and flagging all his spam messages as "not spam" (that is, moving his spam from SpamBox to InBox in dozens of bogus Gmail accounts created for that very purpose, hoping it will clear spam flags on all Gmail accounts). Big data needs to be much smarter about detecting these simple tricks.

Good news: eventually these issues will be fixed. It's easy to fix them.

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service