I am working as a marketing statistician at an online retailer company. We use logistic regression to build response model to decide which customers are worth mailing our catalog. We ususally have somewhere between 600,000 customers with 300 variables. So far, it works out fine compared to the RFM method.
Can anyone suggest other algorithm(s) that you think might be better than logistic regression under some situations? Thank you so much for your reply.
It appears your goal is supervised learning. Have you tried a variable reduction technique to reduce the number of variables to a smaller number of linear combination of variables? You may want to try Partial-Least Squares - Discriminant Analysis for your classification problem, especially if there is any collinearity among the predictor variables. PLS-DA is being used more and more in bioinformatic data mining problems because of multicollinearity.
Hi, Timothy:
Thanks. I used clustering to group the variables and select one variable from each group to achieve reduction on the number of variables. I will certainly try PLS_DA. How do you actually do that in SAS? What procedure do I have to use to do that?
Hi Yi-Chun,
The easiest way to do PLS-DCA is to run PLS in R and then carry out discriminant analysis on the resulting scores. There is a package available for R. Please check the CRAN website.
Kind regards,
Tim
Hi, Timothy:
Thanks. Do you know any papers or introduction to PLS_DA ? This is actually the first time I heard about it. Also, I haven't used R before even though I used S-Plus quite a few years back. Thanks.
I'm concerned that your selection of one variable from each group is losing useful information.
In addition to Tim's excellent suggestion, I would recommend that you try CART or C5.0_with_boosting, and combine the results of the decision tree with Logistic Regression in a confidence-based voting scheme. You could also use the "best" clustering solution as new variables into C5 and LogReg. That might really improve your results.
Hi, Ben:
Thank you very much for your inputs. Our company only has SAS/STAT which does not do the CART or C5.0_with_boosting. What can I do then? Do I have to use open source package like R? Thanks.
So you only have the SAS statistics, but no Data Mining software?
R and WEKA are both extensible, and I recommend that you become
familiar with both. But WEKA is much easier to get started with.
is the website. I also bought their book "Data Mining - Practical Machine Learning Tools and Techniques, 2nd edition" by Ian H. Witten & Eibe Frank. I recommend it highly, especially if you like the tool.
Partial Least Squares is available in SAS Enterprise Miner and so is Principal Component analysis you can
use to reduce correlated predictor variables.
As far as recommendations in SAS/STAT without using Enterprise Miner maybe you can try PROC ROBUSTREG for interval scaled response variables. It's basically an extension of PROC REG made to deal with outliers which you probably have a lot of given the size of your database.
You can also try running survival analysis. SAS has quite a few procedures available and the one I would probably say I like most is Cox Proportional Hazards model you can run using PROC PHREG. Unlike other survival procedures, it is semi-parametric, meaning it has no distributional assumptions. The downside of it is that I have yet to see the code needed to score it. It still may be useful to run it in order to examine the estimates since it has a lot of advanced features you can't do in Logistic Regression such as accounting for late entry into the risk set (i.e. left truncation) and time dependent covariates. I recommend a book written by Paul Allison if you're interested in this.
If you have missing data, SAS has an excellent Multiple Imputation procedure using PROC MI.
Agin, Paul Alison has written an excellent monograph on it.
Other than that I would recommend Neural Networks and Decision Trees but I haven't tried running these outside of Enterprise Miner so I'm not sure if you can run them directly in SAS/STAT.