Data Intelligence, Business Analytics
Hi All,
Iam currently involved in building a response model for MPE client. sample data set has 3800 variables and 50000 cases. In logistic regression how I can reduce the variables (approx. 2500 continous variabbles and 1300 categorical variable) how to take it forward?
Regards,
Ravi
Tags:
Permalink Reply by Neil McGuigan on April 26, 2011 at 11:47am you could use principal components analysis
or,
try variable selection techniques such as stepwise or genetic ones
Permalink Reply by Mike Olson on April 27, 2011 at 4:51pm I'd also recommend trying principal component analysis.
You could also use a decision tree algorithm (like C4.5) to generate a tree with limited depth, to figure out which variables are giving you the most information. Then you can throw out the rest.
A third option would be to see if you have any variables that are highly correlated, and keep only one out of each set of correlated variables.
Thanks Mike. I will try these methodologies from my end and keep you posted.
Permalink Reply by Puneet Agarwal on April 28, 2011 at 10:13am I would say Principal component analysis is a good way of reducing the number of variables but the PCAs would not make sense when you try to implement a logistic regression model or to decide the strategies. In my personal experience, It is very hard to explain PCA variables to the business partners and get them to digest the fact that each PCA is a combination of all the variables.
You can try stepwise logistic regression method, use VIF and reduce multicollinearity, check correlations between variables, check fill rates for variables and remove variables with less than 60% fill rate as you would need to impute the remaining observations. You could also try variable clustering methodology.
Permalink Reply by Alex Zolot on May 26, 2011 at 10:40am in R something like that:
install.packages('randomForest')
library(randomForest)
rf = tuneRF(Xvars, Yvar, stepFactor=1.2, doBest=T)
rfi = rf$importance
barplot(rfi[order(- rfi)])
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC