Data Intelligence, Business Analytics
Iam currently involved in building a response model for MPE client. sample data set has 3800 variables and 50000 cases. In logistic regression how I can reduce the variables (approx. 2500 continous variabbles and 1300 categorical variable) how to take it forward?
you could use principal components analysis
try variable selection techniques such as stepwise or genetic ones
I'd also recommend trying principal component analysis.
You could also use a decision tree algorithm (like C4.5) to generate a tree with limited depth, to figure out which variables are giving you the most information. Then you can throw out the rest.
A third option would be to see if you have any variables that are highly correlated, and keep only one out of each set of correlated variables.
Thanks Mike. I will try these methodologies from my end and keep you posted.
I would say Principal component analysis is a good way of reducing the number of variables but the PCAs would not make sense when you try to implement a logistic regression model or to decide the strategies. In my personal experience, It is very hard to explain PCA variables to the business partners and get them to digest the fact that each PCA is a combination of all the variables.
You can try stepwise logistic regression method, use VIF and reduce multicollinearity, check correlations between variables, check fill rates for variables and remove variables with less than 60% fill rate as you would need to impute the remaining observations. You could also try variable clustering methodology.
in R something like that:
rf = tuneRF(Xvars, Yvar, stepFactor=1.2, doBest=T)
rfi = rf$importance