Data Intelligence, Business Analytics
I have a few queries on handling multi-collinearity
1) Multi-collinearity should be taken care of before running an OLS model.Does one need to 'necessarily' follow the same w.r.t. logistic regression too (even though it is not an assumption of Log reg)?
Am I right when I say that one takes care of multi-collinearity before building a log-reg model for the purpose of variable reduction?
2) Let's say I have a dependent variable 'y' and 10 independent variables 'x1-x10' and suppose x1,x2,x3 and x4 are multicollinear to each other (found out by using PROC CORR;VIF option in PROC REG using weights as recommended by Paul Allison or PROC FACTOR - the 3 methods which I know of or even PROC VARCLUS?-only heard about it).
How does one go about retaining or eliminating the variables before building a model?In other words how will I know which of the variables x1,x2,x3,x4 to retain or drop?Is it OK to retain more than one variable out of x1-x4 or is there a hard and fast rule that only one out of x1-x4 must be retained and the rest dropped?
On what basis - the highest factor loading on the plane or is there anything more that needs to be looked at?
Sorry about that. pressed the post button too early. The more experienced approach as opposed to academic approach is as follows:
1)This is only a problem if the actual variable switches in sign in the multivariate routine from what you observe in correlation between the actual indep. variable and the dependant variable.
2)if this variable does switch in sign(very high multicollinearity), drop it and then see what other variable takes its place presuming no switching in sign with the new variable or variables and of course minimal compromise to your model's performance.
Hope this helps
I not 100% sure about logistic regression, but for normal regression you should keep whichever variable has the strongest correlation, because it accounts for the most variance. If you have four all accounting for the same variance, it'll only make prediction harder because you won't know which variable is generating the predictions.
If they all account for a similar amount of variance, then I think you could really pick at random. Its not a hard and fast rule, but it is a good idea to make prediction more reliable and the model more stable.
But a better option might be to use some sort of dimensionality reduction tool like PCA etc to actually optimise the variance accounted for by each feature, especially if each is accounting for a decent amount of unique variance as well.