Data Intelligence, Business Analytics
is it neccessary to exclude independent variables from a regression model based on the fact that they are correlated. i am working on a logistic regression model built from a very large dateset but with a very big imbalance in the population size betwen the target variables i.e very large size of goods and small size of bads.
my model has proved consistent for a ong period with inclusion of these variables as they all have their own independent definition and i need them in the model.
My understanding is that it will destabilise the model if a later dataset has different correlations between the IV's. In that case, the model will fail, because the previously determined coefficients derived for the correlated variables will be incorrect. Collinearity basically makes it impossible to determine which variable contributes to which part of the variance, because they're all contributing together.
If you need every variable in the model, then do a PCA on the IVs first and then put the resulting factors in the model. Alternatively, use another technique like SVM or similar which does its own variable selection.
If you think that you need all of the variables in the model, then there may be another latent variable, not in the model which needs to be included. Short of that, it is ok to include correlated variables in the model as long as you understand that its increase the standard error of the coefficients. It may make sense to bootstrap sample and see how these are affected.