Data Intelligence, Business Analytics
Hi guys..I am working on a logistic model. When I did out of sample validation, my percentage detection of the defaulters was 80%. The next I tried is out of time validation. To my dismay the accuracy(percent detection) came down to 33% this time. I am wondering and disappointed by what could have happened. I have profiled both the population and found differences in the distribution of few categorical variables.
Please pour in your ideas as to what can be done to improve the accuracy in the out of time dataset or what could have gone wrong. If required to defend before the client, what justification can one give for the downfall in accuracy ?
Thanks,
Ayush
Comment
Comment by Ayush Biyani on June 8, 2011 at 3:20am
Comment by Jozo Kovac on June 5, 2011 at 3:08pm Ayush, remove predictors with diferent distributions between months & re-train model. Your performance will decrease, but prediction will be more stabile and more useful.
Another approach is to train your model on union of both time periods. But the first one is better.
Hope you've defended well :) Btw. let them to explain differences between periods, maybe you'll find a better solution then.
Comment by Ralph Winters on June 3, 2011 at 9:34am Hari. I would consider using binary time series for a straight logistic regression problem. Alternatively you can try looking at Cox regression which uses a survival (hazard function) model instead of an logit model.
-Ralph Winters
Comment by Miles Garnsey on June 3, 2011 at 6:27am I'm by no means an expert on this sort of thing, but surely defaulters' behaviour is going to change with the economic situation to some degree, was your training data from before the GFC for example, while your testing data was from after?
Alternatively, was the data from different times of the year? There may be seasonal aspects to some of that behaviour. This has got to be a hard area to work in at the moment given the turbulence of the housing market in a lot of places around the world. Where does the data come from and what are the two periods if you don't mind me asking?
Comment by Ayush Biyani on June 2, 2011 at 10:06pm
Comment by Hariharan Sunder on June 2, 2011 at 5:48am Ayush,
Does your modeling sample consist of data collected from only one period in time? If so the model may be necessarily hold good on a out of time sample. I think your modeling sample needs to be much more random so as to include effects of different time-periods.
Ralph,
Is there any way to include time-series modeling in logistic regression. If so could you throw some light on it.
Thanks,
Hari
Comment by Ralph Winters on June 1, 2011 at 2:56pm From what you describe, time can be a factor and you need to look into a different modeling methodology. Especially if you are seeing some of the categorical variables change over time. Try looking into time series cross sectional modeling.
-Ralph Winters
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge