Subscribe to DSC Newsletter

What causes predictive models to fail - and how to fix it?

  • Over-fitting.If you perform a regression with 200 predictors (with strong cross-correlations among predictors), use meta regression coefficients: that is, use coefficients of the form f[Corr(Var, Response), a,b, c] where a, b, c are three meta-parameters (e.g. priors in a Bayesian framework). This will reduce your number of parameters from 200 to 3, and eliminate most of the over-fitting
  • Perform the right type of cross-validation. If your training set has 400,000 observations distributed across 50 clients, and your test data set (used for cross-validation) has 200,000 observations but only 3 clients or 5 days worth of historical data, then your cross-validation methodology is very flawed. Better, split your cross-validation data set in 5 subsets to compute confidence intervals. Do smart sampling.
  • Messy data. Make sure you've eliminated outliers and cleaned your data set. Use alternate (external) data sets to better control and reconcile data.
  • Data maintenance. When did you last update this lookup table? Five years ago? Time to do maintenance checks!
  • Use robust, data-driven procedures. Stay clear of normal distributions and simplistic models such as naive Bayes.
  • Poor design of experiment. Usually a sampling issue.
  • Confusing causes and consequences, ignoring hidden variables that indeed explain unexpected correlations (e.g. my age is correlated with oil price - but not causing oil price increase, the real cause is inflation, which is correlated both to age and oil price)



Views: 644


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Keith Schleicher on July 19, 2011 at 3:02pm

Suggestions to avoid failure:

--Determine what measure(s) you will use to determine if your model solves the problem it is supposed to address

--Know thy data, especially why data is missing.  

--Have both a holdout sample (to ensure your algorithm doesn't overfit the data) and an out of time period sample to guard against the issue mentioned earlier

--Understand the data chronology (time is a commonly omitted variable that is often to account for)

--After constructing your model, calculate the residuals (actual-predicted) and re-model them using a different approach (preferably data driven) to see where the model doesn't fit well.

Comment by Nathan Och on June 6, 2011 at 12:53pm
Can the use of extremely exogenous variables, or variables at the macro level, also increase the inefficiency of a model? I work with mainly micro level data (specific to person/account) and I find that introducing macro variables can sometimes cause the model to misspecify variables that actually do not have an impact per se, but more of a linear trend. Because these macro variables are on such a larger scale, it seems correlation, instead of causation, is found.
Comment by Jozo Kovac on June 5, 2011 at 3:12pm
... plant Random Forests :)
Comment by Ralph Winters on June 1, 2011 at 1:09pm

Assuming the wrong distribution, or the correct distribution but with the wrong parameters.  Follow the plight of Long Term Capital Management, who assumed the correct distribution form ("normal") but got the variance (<9%) wrong.


-Ralph Winters

Comment by Name Withheld on May 30, 2011 at 8:24am

Some interesting points in here... Regarding Vincent's comment on random noise causing large shifts in parameter estimates, and this being one way to tell if you are overfitting; I often wonder if one could implement some sort of constraint in the optimisation algorithms so that the model stops learning once it starts to model the noise.


I think this is similar to what is explained in Ye's "On Measuring and Correcting the Effects of Data Mining and Model Selection" perhaps? I've not seen this approach implemented on an automatic sort of basis, but it seems like it could be a powerful way to keep the algorithm fitting just the signal rather than the noise as well. I might try and do something like this in R once I've worked out what I'm doing with that a bit better!

Comment by Richard Boire on May 29, 2011 at 12:09pm

Your comments and Edmund's comments  really hit the nail on the head when it comes to model failure which is really about having proper validations. In our business, we do multiple validations:

1)50/50 where the analytical file is split 50% into development and 50% into validation.

2)out of time validation where we sample same population but in period after the period used for model development.

3)out of time validation where we sample same population but in period before the period used for model development.

4) will also try multiple model versions to see which one has the greatest stability using  the above three validation options. 

Comment by Dr. Vincent Granville on May 29, 2011 at 11:26am

You also know that your model has problems if, when you introduce small random noise in your data, your parameter estimates vary wildly (whether or not your foretasted values are stable). 

When this happens, it means your parameter estimates are very sensitive to your dataset, and this high sensitivity means over-fitting is taking place. This routinely happens in large decision trees with hundreds of nodes, and in logistic / linear regression with a large number of correlated dependent variables.

See also my comment posted on LinkedIn to reduce risks of model failure:

  • Picking the right compound metrics out of raw variables (e.g. type of ID address - corporate, proxy , static, blacklisted etc. - rather than IP address, to feed a decision tree for spam detection)
  • Picking the right data set(s) in the first place, which sometimes means identifying, evaluating and obtaining external data (such as a list of blacklisted IP addresses in the above example)
Comment by Edmund Freeman on May 28, 2011 at 10:10pm
The usual trouble I've run into is that my modeling fits my data set fine, but there is some aspect to the data that is unique to the particular data set and doesn't hold up when I go to actually deploy the model. I've found a couple of fixes: 1) if possible, use data from multiple time periods 2) make sure I really understand what the model is doing and that the meaning of the relevant data sources is stable.

© 2015 is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service