Subscribe to Dr. Granville's Weekly Digest

how to choose predictive variables in my time series regression model

Hi,
I have to forecast daily sales for my company and I have a list of 300+ potential variables that can be predictive of the daily sales. How do I decide which one to include in my time series regression model? Do I have to go through steps like prewhitening and cross-correlation function for each of them? How do I check multicollinearity amont these 300+ explanatory variable? Thanks.

Views: 1477

Replies to This Discussion

How far back does your data go? If you only have one or two years worth of data you could run into over fitting problems using just data driven methods.

That being said, cross correlation is a good way to filter out bad variables that likely won't be very predictive. (provided you do the cross correlation on each variable for all the different lags and transforms which you think make sense. )

Using domain expertise and limiting your selections and tests to variables (and transforms) that make sense are a good way to reduce over fitting.
Well, a lot of it depends on how you do this entire analysis. Choosing your variables, I believe you must be able to filter out the most important/contributing variables through Factor Analysis or OLS Regression. That gives you a brief idea which variables are good for the model to be built.

Second, with the business knowledge, I think you should be able to narrow down the list of variables. Finally, I think you may have to dive into TimeSeries.

PreWhitening is the process where you make a process stationary prior to using it in the model. You may have to check the stationarity of your predicting variables for this.
Cross-correlation is slightly a tougher issue to solve. I believe, if you are using SAS, some procedures like PROC STATESPACE etc take care of multiple collinearity. Otherwise, you may have to first deal with the cross-correlation first using PROC CORR and make sure they don't add to the model!
Thanks a lot. Where can I learn all this from? Could you recommend some book(s) or resources that I can turn to for issues like these? I took a training course with SAS on time series but it didn't cover these topics due to the lack of TIME.
Take a look at the following paper. It gives the variants that you could try.
http://www2.sas.com/proceedings/sugi30/080-30.pdf

But if you are looking for anything specific, I'd recommend to search it up in the SUGI papers list. It's the best place for SAS! It's part of the above doc, TOC-SUGI. I also think you should find papers there on UCM or STATESPACE or ARIMAX etc. IF not, SAS website should have them too, but so would SAS Help!

There isn't too much to time series itself. It's only an extension of Regression, so if you've got the basics right, I think these tools are very helpful to help you look for what you need.
You can try regression to first weed out insignificant variables...that should substantially prune your variable list. Next, you can check the correlations or conditional indices to tag correlated variables and keep those which have a higher significance level to sales (from your regression output). This is the most time consuming part.
If you're left with a manageable set, I would recommend using business sense along with basic scatter plots to zero in on variables. This way you can get to a stage where you are confident of which variables to include in the final model.
Business knowldege (domain exeprtise) could defintely help in pruning the set of variables from the starting set of 300 to a smaller set. But even if you cut it down to a 100 variables, taking those and lags of different orders on these variables, you could have an overwhelming number of "explanatory" variables to forecast the dependent variable (daily sales). Sometimes a model in which the lag of the dependent variable is used as an explanatory variable along with the other selected variables among the 300 (perhaps with lags of a few of a them, based on intuition) will not only reduce the number of explantory variables and thereby increase the degrees of freedom for the prediction model but also provide more stable predictions. Also one can make use of the first so many principal components among the chosen predictor variables to deal with multicollinearity issues which typically arise in such probelms. This also cuts down the number of parameters and thereby increases the df of the model predictions.

How many principal components to use is a judgement issue although theere are statistical criteria that can be employed. One practical way is to see how the R-sqaure improves as the number of principal components is increased. If it incresaes rapidly and then levels off, you can stop where it levels off. Once the prediction model si expressed in terms of principal components which are mutually uncorrelated linear combinations of some of the original variables, the model can be translated back to an equivalent form in terms of the original variables so it is easier to interpret. This is called principal components regerssion.
You can use all of your variables. Apply Game Theory to get rid of multicollinearity and over fitting. This is the best way to get the most accuracy. Prewhitening, Regression, Statespace, Arimax and cross-correlation gets you forecast, Game Theory will get you the most accuracy.
Hi, Murat:
Can you tell me more about how game theory can help? I don't have background in economics and according to my very limited awareness of game thoery, it is like two players are playing to beat the other.
There is an automated general-to-specific approach procedure in PcGive, a module of OxMetrics econometrics software. The procedure is extremely powerful and statistical robust. It can reduce a big model consisting of 300+ variables to alternative (usually 3-8) well specified parsomonious models. In turn, you can either choose the model that you believe is the most appropriate or leave the procedure choose the one which has the best fit.
the company that I work for is selling this software. I know it sounds like a sales pitch but this procedure is designed exactly for dealing with problems like yours. I can demonstrtae it to you with your data if you want. Please feel free to contact me if you would like to give it a go.

RSS

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service