# AnalyticBridge

Subscribe to Vincent Granville's Weekly Digest:

# To sample or not to sample, what do you think?

Here are my thoughts:

In combinatorial problems, sampling is necessary. If you try to find the optimum vector of attributes (e.g the one with best fraud discriminative power) in a data set that has 40 attributes, you must sample: to compute the discriminative power of an attribute, you need to process (say) 50 million observations (your data set). And the total number of potential vectors is 2 at power 40. In short, you need to process 2^40 * 50MM = 5 * 10^18 data points, ideally in a few hours. Of course there are algorithms to significantly reduce the amount of computations by testing multiple vectors at once, that's what I designed when I was working with Visa to detect credit card fraud. Yet sampling the vector space is necessary.

For more general types of problem (computing averages, maxima, parameters etc.), sampling is always great as long as done correctly with correct cross-validation. More on this later.

Now if you want to compute highly granular data (e.g. value of each single home in US), you might keep all your data. Still you will need to perform some sound statistical inference for homes with little historical data. More on this later.

In short, you need a statistician involved in almost all these situations, and not just computer scientists. Or you will get poor predictive power.

Views: 277

### Replies to This Discussion

On the first point, there is a difference between sampling the model space (which seems to be what you are discussing) and sampling the data. The latter is much more problematic: data is precious.

On the second point, if you are looking for the mean, the typical, or anything like that, you can indeed sample the data, though I am not sure why you would particularly want to these days when 'Big Data' is a well-understood problem.

You can't, however, sample “when you are looking at the whole rather than the parts, and when that whole is a different ‘thing’ from the parts. A nation is more than a group of people, a city more than a bunch of houses. When you have data on people and houses but want to know about nations and cities then you are into Big Data territory”, as we wrote at When Big Data Matters where we also have a practical business example of this and touch on your last point. You might like the little model we use there.

Some other settings where sampling is necessary:

• clinical trials
• testing whether a product (e.g. a soda drink) meets specific criteria (e.g. the proportion of sugar is less than 8%)

It's much easier to remove outliers from a sample.

To my mind there are no entirely satisfactory automated methods for identifying outliers, so "manual" inspection is needed.

1

2

3

4

5

6

7

8

9

10