# AnalyticBridge

Subscribe to Vincent Granville's Weekly Digest:

# Simple technique to improve poor predictive models

This technique does not exploit the original data used to produce the model, but just the predicted and observed values, and nothing else. It was initially designed in the context of time series, to improve daily weather forecasts or daily stock trading signals.

The enhanced model in the chart below is an example of improvement (higher ROI) obtained in the context of trading strategies:

Here's how it works:

Principle:

A good time series model produces estimates where the error between observed value and forecast is essentially a white noise process, with very little if any auto-correlations. If some strong auto-correlation or other dependence patterns are found in the time series of residual errors, then the predictive model can be enhanced.

Practical example:

For simplicity, let's use daily weather forecasts. Assume that the forecast (for a specific location) can take any of the following values: Sunny (S), Cloudy (C), Rain (R), Other (O). Let's define a path as any sequence of consecutive daily forecasts.

The length of a path its the number of days. For instance S->R->R->R is path of length 4. If you check all sequences S->R->R->R and find that on average, the last prediction in that path is more often wrong than right, and C (Cloud) would a better predictor than R (Rain) for the most recent day, then the enhanced model simply consists in replacing the last R prediction from the base model by a C prediction in all S->R->R->R paths. In short, Enhanced(S->R->R->R) = S->R->R->C.

Apply the same strategy for all paths of length 4 (paths with enough occurrences) and you get a better predictive model.

Implementation tips:

1. Check out all paths of length 1, 2, 3 or 4 with at least 20 occurrences over the last 12 months. Use enhanced strategy only for paths where predicted value (using base strategy, for most recent day) is significantly worse than the prediction based on the enhanced strategy. Do cross-validation for the enhanced model.
2. It goes without saying that good cross-validation is required, as in all predictive models. Avoiding over-fitting is as straightforward as in any predictive techniques: focus on paths with large volume AND where residual error is large. They represent 80% of the volume, yet less than 5% of all paths.
3.  Bin your data if it is not categorical. Then you'll find that 80% of the volume is contained in 5% of the paths. Focus on these 5%.
4. Binning the predicted values will significantly increase the number of repeating sequences, and will reduce the total number of unique sequences. Extreme binning, resulting in binary forecasts, will produce substantial reduction in the number of different paths: indeed, you will have no more than 30 different paths of length less than or equal to 4. If you have 600 days worth of forecasts and only 30 possible paths, path redundancy will be huge - each path having on average 20 clones.

Over-fitting, in this context, has nothing to do with underlying data: it has to do with enhancing paths where number of occurrences is too small to have statistical significance.

Related keywords: Runs, time series, markov chains, residual error, auto-correlations

Views: 1672

Comment

Join AnalyticBridge

Comment by Ralph Winters on November 3, 2012 at 8:16am

Vincent.  I'm not sure what the statistical basis behind this method is.  Would you advocate jumbling all of  the coefficients of a bad linear regression model until it led to an improved best r-square?

-Ralph Winters

Comment by Eli Y. Kling on November 1, 2012 at 3:16am
Very interesting concept - I'll have to try it.
One thing you might sharpen in the above text is that you are discussing sequnces of predictions (or modeled sequnces).
I am not sure cross validation is that straight forward to implement.
Just a thought - using this general idea in non time series situations: do a quick chaid on the prediction erorrs on a test or validation set to evaluate the conditional bias drivers (if they exist)

1

2

3

4

5

6

7

8

9

10