Data Intelligence, Business Analytics
there are two methods to reduce variables as you may be aware. Principal component analysis & factor analysis.
for all the technical details you can refer to statsoft text book at http://www.statsoft.com/textbook/principal-components-factor-analysis/
Tom - I am not understanding your objection to Factor Analysis (or Regression). Certainly the time taken to understand the relationships in the factors is equivalent to understanding the 100's of decision trees that can be output via the Random forest process. And that still wouldn't solve your correlation problem. Any modeling technique using correlated variables will have the importance of those effects diminished, whether linear or not.
Thanks for your post. You are correct in noting that the random trees approach doesn't solve the predictor correlation problem. It merely develops a greatly shortened scorecard or laundry list of predictors to be used in a further stage of model refinement. But I would also suggest that FA, too, merely develops a similar laundry list requiring refinement. If one has run an orthogonal factor solution then it's guaranteed that the factors are linearly independent. There is no such guarantee, however, that the predictors are similarly independent since each item has a loading on each factor which may or may not be nonzero. In addition and since the factor solution has been developed in the absence of the predictors' relationship with the DV, further model refinement is a requirement. I remain unconvinced of the value of the information from FA in the absence of a DV.
The thing that I have against using the Factor Analysis to reduce the number of predictors is that you would loose the explanatory power of the model. A model built on Factors become very difficult to explain to the business partners and it is impossible to use a model built by factors to help in any strategy formulation. If the aim is just to build a model, then its fine but if it for a business requirement, it may be rendered useless.
Tom and Puneet,
Another point I would like to add is that you can use Factor Analysis to simplify things by highlighing variables which have NO significant loadings on any particular factor. In these cases variables will simply drop out of the original model. I have found this to be the case when using variables from Enterprise databases where there is much duplication and redundancy. No one forces you to perform Factor Analysis and actual use the factors!