Subscribe to Vincent Granville's Weekly Digest:
Hi All,   Iam currently involved in building a response model for MPE client. sample data set has 3800 variables and 50000 cases. In logistic regression how I can reduce the variables (approx. 2500 continous variables and 1300 categorical variable)…you may suggest me with your thoughts...
Regards,
Ravi

Views: 188

Replies to This Discussion

there are two methods to reduce variables as you may be aware. Principal component analysis & factor analysis.

 

for all the technical details you can refer to statsoft text book at http://www.statsoft.com/textbook/principal-components-factor-analysis/

Hi Ravi-
I have some suggestions about variable selection with a large set of candidate predictors. First, I'm not sure I agree that FA is the best way to go about this. I say this knowing full well that it is a widely used practice in industry. My objections to FA boil down to one strong fact -- it's like running the regression without a dependent variable. In my view it is essential to know how the predictors interact with the DV to be able to make a sensible decision about a predictor's inclusion in the model.
Leo Breiman developed the random forests methodology for classification trees with a large set of predictors. As you know the approach involves running thousand(s) of small trees where predictors are randomly sampled (a small handful per tree...maybe 10-20) as well as subjects (cases or observations) being sampled using the bootstrap. Breiman discovered that by aggregating the information across the iterations a kind of scorecard could be developed for each predictor. Variable selection would then focus on the highest ranking predictors.
This technique can easily be extended to multiple regression, ANOVA, logistic regression or just about any other multivariate technique. Since logistic regression can be quite CPU-intensive, it wouldn't be a mistake to use ANOVA as a less expensive first pass at the answer. This approach is easily implemented in SAS as I have used it many times with significant success. It is important to note however that algorithmic answers like this are not a substitute for or improvement over substantive knowledge of the category.
Hope this helps,
Tom

Tom - I am not understanding your objection to Factor Analysis (or Regression).  Certainly the time taken to understand the relationships in the factors is equivalent to understanding the 100's of decision trees that can be output via the Random forest process. And that still wouldn't solve your correlation problem.  Any modeling technique using correlated variables will have the importance of those effects diminished, whether linear or not.

 

-Ralph Winters

Ralph-

  Thanks for your post.  You are correct in noting that the random trees approach doesn't solve the predictor correlation problem.  It merely develops a greatly shortened scorecard or laundry list of predictors to be used in a further stage of model refinement.  But I would also suggest that FA, too, merely develops a similar laundry list requiring refinement.  If one has run an orthogonal factor solution then it's guaranteed that the factors are linearly independent.  There is no such guarantee, however, that the predictors are similarly independent since each item has a loading on each factor which may or may not be nonzero.  In addition and since the factor solution has been developed in the absence of the predictors' relationship with the DV, further model refinement is a requirement.  I remain unconvinced of the value of the information from FA in the absence of a DV.

Thanks,

Tom

Hi Ralph,

 

The thing that I have against using the Factor Analysis to reduce the number of predictors is that you would loose the explanatory power of the model. A model built on Factors become very difficult to explain to the business partners and it is impossible to use a model built by factors to help in any strategy formulation. If the aim is just to build a model, then its fine but if it for a business requirement, it may be rendered useless.

 

Thanks,

Puneet

Tom and Puneet,

Another point I would like to add is that you can use Factor Analysis to simplify things by highlighing variables which have NO significant loadings on any particular factor.  In these cases variables will simply  drop out of the original model.  I have found this to be the case when using variables from Enterprise databases where there is much duplication and redundancy. No one forces you to perform Factor Analysis and actual use the factors!

 

-Ralph Winters

 

 

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service