A Data Science Central Community
All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.
Here we introduce the concept of bumpiness and show how it can be used. Two different datasets can have same mean and variance, but a different bumpiness. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.
The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.
Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.
1. How can bumpiness be defined?
Given a time series, an intuitive, scale-dependent and very robust metric would be the average acute angle measured on all the vertices (see chart below to visualize the concept). This metric is bounded:
This metric is totally nonsensitive to outliers. It is by all means a modern metric. However, we don't want to re-invent the wheel, and thus we will define bumpiness using a classical metric, that has the same mathematical and theoretical appeal and drawbacks as the old-fashioned average (to measure centrality) or variance (to measure volatility).
We define the bumpiness as the auto-correlation of lag one, denoted here as r.
Three time series with same mean, same variance, same values, but different bumpiness
Note that the lag one auto-correlation is the highest of all auto-correlations, in absolute value. Thus it is the single best indicator of the auto-correlation structure of a time series. It is always between -1 and +1. It is close to 1 for very smooth time series, close to 0 for pure noise, very negative for periodic time series, and close to -1 for time series with huge oscillations. You can produce an r very close to -1 by ordering pseudo random deviates as follows: x(1), x(n), x(2), x(n-1), x(3), x(n-2)... where x(k) [k=1, ..., n] represent the order statistics for a set of n points, with x(1)=minimum, x(n)=maximum.
A better but more complicated definition would involve all the autocorrelation coefficients embedded in a sum with decaying weights. It would be better in the sense that when the value is 0, it means that the data points are truly independent for most practical purposes.
2. About the Excel spreadsheet
Click here to download the spreadsheet. It contains a base (smooth, r>0) time series in column G, and four other time series derived from the base time series:
Two core parameters can be fine tuned: cells N1 and O1. Note that r can be positive even if the time series is trending down: r does not represent the trend. Instead, a metric that would measure trend would be the correlation with time (also computed in the spreadsheet).
The creation of a neutral time series (r=0), based on a given set of data points (that is, preserving average, variance and indeed all values) is performed by re-shuffling the original values (column G) in a random order. It is based using the pseudo-random permutation in column B, itself created using random deviates with RAND, and using the RANK Excel formula. The theoretical framework is based on the Analyticbridge Second Theorem:
Analyticbridge Second Theorem
A random permutation of non-independent numbers constitutes a sequence of independent numbers.
This is not a real theorem per se, however it is a rather intuitive and easy way to explain the underlying concept. In short, the more data points, the more the re-shuffled series (using a random permutation) looks like random numbers (with a pre-specified, typically non-uniform statistical distribution), no matter what the original numbers are. It is also easy to verify the theorem by computing a bunch of statistics on simulated re-shuffled data: all these statistics (e.g. auto-correlations) will be consistent with the fact that the re-shuffled values are (asymptotically) independent from each other.
For those interested, click here to check out the first analyticbridge theorem.
Note that Excel has numerous issues. In particular, its random number generator is terrible, and values get re-computed each time you update the spreadsheet, making the results non replicable (unless you "freeze" the values in column B).
3. Uses of the bumpiness coefficients
Economic time series should always be studied by separating periods with high and low bumpiness, understand the mechanisms that create bumpiness, and detect bumpiness in the first place. In some cases, the bumpiness might be too small to be noticed with the naked eye, but statistical tools should be able to detect it.
Another application is in high frequency trading. Stocks with highly negative bumpiness in price (over short time windows) are perfect candidates for statistical trading, as their offer controlled, exploitable volatility - unlike a bumpiness close to zero, which corresponds to uncontrolled volatility (pure noise). And of course, stocks with highly positive bumpiness don't exist anymore. They did 30 years ago: they were the bread and butter of investors who kept a stock or index forever and see it automatically grow year after year.
Generalization: How do you generalize this definition to higher dimensions, for instance to spatial processes? You could have a notion of directional bumpiness (North-South or East-West). Potential application: flight path optimization in real time to avoid serious bumpy air (that is, highly negative wind speed and direction bumpiness).
A final word on statistics textbooks. All introductory textbooks mention centrality and volatility. None mention bumpiness. Even textbooks as thick as 800 pages will not mention bumpiness. The most advanced ones discuss generating functions and asymptotics theorems in details, but the basic concept of bumpiness is beyond the scope of elementary statistics, according to these books and traditional statistics curricula. This is one of the reasons we have written our own book and created our modern data science apprenticeship, to offer more modern, practical training.
Related articles
Comment
Its an issue we have been dealing with in data visualisation for some time. When tracking performance over time, obviously data with 'bumpy' characteristics cannot be represented using normal trend lines.This might be sales that are irregular in timing or in value [a non-uniform statistical distribution], or it might be trying to isolate out the normal variance, what you might refer to as 'normal bumpiness' of data so that we do not react to minor fluctuations. Having a standard metric that can be applied consistently to these different kinds of data scenarios will help to normalise the data in a way that provides more reliable insight. Being able to combine this with regression analysis to identify the factors driving the bumpiness, without it being misinterpreted as a major variate at one time, and then not at another would be helpful.
Also, Bill Luker Jr posted the following comment:
From a practitioner's perspective, is that it is a measure of noise, a detector of outliers that may show up as unaccounted-for noise, from the way, say, a process is producing the data, even a data-entry process, or some other force or process/system giving rise to that (those) particular noise(s). Yes?
So the thing would be try various tactics for reducing bumpiness, maybe by screening those outliers, etc., and even running a TSA on the residuals after factoring or "partialing out" the bumpiness.
But isn't that part of whitening? An SOP in Box-Jenkins Analysis or old ARIMA models?
Help me understand this.
Thanks
Bill Luker
My answer:
Noise will produce moderate bumpiness. Strong bumpiness is caused by external forces that create negative correlations between observation at time t and time t+1 (or t+2, t+3 etc.) You can have outliers and low bumpiness or the other way around. Perfectly periodic time series are very bumpy (according to my definition of bumpiness), although they obviously have no outliers.
Instead of bumpiness, maybe we should be concerned about lumpiness. In many time series, events occur in spurts. A classic example is the "hot hand" in basketball.
Recent work has developed measures of lumpiness n time series, where a value of zero indicataes equal spacing of events over time, and value near one measuring the presence of the "hot hand" or spurts of events (like purchases or usage).
Hi Dan,
Yes, columns D, J and L are auxiliary columns with very simple patterns, I created them as static values rather than a formula:
Thanks,
Vincent
Hi Vincent,
A very interesting concept here. I'm working through your spreadsheet. How are you calculating the bumpy ranking (column D as well as J and L). It appears these are static data from the original concept tab.
D.
Precision:
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge