AnalyticBridge

Social Network For Analytic Professionals

Hi,

I am working as a marketing statistician at an online retailer company. We use logistic regression to build response model to decide which customers are worth mailing our catalog. We ususally have somewhere between 600,000 customers with 300 variables. So far, it works out fine compared to the RFM method.

Can anyone suggest other algorithm(s) that you think might be better than logistic regression under some situations? Thank you so much for your reply.

Share

Reply to This

Replies to This Discussion

Hi Yi-Chun,

It appears your goal is supervised learning. Have you tried a variable reduction technique to reduce the number of variables to a smaller number of linear combination of variables? You may want to try Partial-Least Squares - Discriminant Analysis for your classification problem, especially if there is any collinearity among the predictor variables. PLS-DA is being used more and more in bioinformatic data mining problems because of multicollinearity.

Cheers!
Tim

Reply to This

Hi, Timothy:
Thanks. I used clustering to group the variables and select one variable from each group to achieve reduction on the number of variables. I will certainly try PLS_DA. How do you actually do that in SAS? What procedure do I have to use to do that?

Reply to This

Hi Yi-Chun,
The easiest way to do PLS-DCA is to run PLS in R and then carry out discriminant analysis on the resulting scores. There is a package available for R. Please check the CRAN website.
Kind regards,
Tim

Reply to This

Hi, Timothy:
Thanks. Do you know any papers or introduction to PLS_DA ? This is actually the first time I heard about it. Also, I haven't used R before even though I used S-Plus quite a few years back. Thanks.

Reply to This

Yi-Chun

I'm concerned that your selection of one variable from each group is losing useful information.
In addition to Tim's excellent suggestion, I would recommend that you try CART or C5.0_with_boosting, and combine the results of the decision tree with Logistic Regression in a confidence-based voting scheme. You could also use the "best" clustering solution as new variables into C5 and LogReg. That might really improve your results.

kind regards,
Ben Dickman
Central Connecticut State Univ.
www.ccsu.edu/datamining/

Reply to This

Hi, Ben:
Thank you very much for your inputs. Our company only has SAS/STAT which does not do the CART or C5.0_with_boosting. What can I do then? Do I have to use open source package like R? Thanks.

Reply to This

Yi-Chun,

So you only have the SAS statistics, but no Data Mining software?
R and WEKA are both extensible, and I recommend that you become
familiar with both. But WEKA is much easier to get started with.

http://www.cs.waikato.ac.nz/ml/weka/

is the website. I also bought their book "Data Mining - Practical Machine Learning Tools and Techniques, 2nd edition" by Ian H. Witten & Eibe Frank. I recommend it highly, especially if you like the tool.

kind regards,
Ben Dickman
Central Connecticut State Univ.
www.ccsu.edu/datamining/

Reply to This

Partial Least Squares is available in SAS Enterprise Miner and so is Principal Component analysis you can
use to reduce correlated predictor variables.

As far as recommendations in SAS/STAT without using Enterprise Miner maybe you can try PROC ROBUSTREG for interval scaled response variables. It's basically an extension of PROC REG made to deal with outliers which you probably have a lot of given the size of your database.

You can also try running survival analysis. SAS has quite a few procedures available and the one I would probably say I like most is Cox Proportional Hazards model you can run using PROC PHREG. Unlike other survival procedures, it is semi-parametric, meaning it has no distributional assumptions. The downside of it is that I have yet to see the code needed to score it. It still may be useful to run it in order to examine the estimates since it has a lot of advanced features you can't do in Logistic Regression such as accounting for late entry into the risk set (i.e. left truncation) and time dependent covariates. I recommend a book written by Paul Allison if you're interested in this.

If you have missing data, SAS has an excellent Multiple Imputation procedure using PROC MI.
Agin, Paul Alison has written an excellent monograph on it.

Other than that I would recommend Neural Networks and Decision Trees but I haven't tried running these outside of Enterprise Miner so I'm not sure if you can run them directly in SAS/STAT.

Reply to This

Dear Mr.Tsai,
Try to use the Decision Tress (CART).
Can you share me your data to have alook insude them?

All my best regards

Reply to This

Hi,
I can't share the company's data with you. However, thank you for sharing the idea with me.

Reply to This

RSS

Featured


Advertisement

© 2010   Created by Vincent Granville

Badges  |  Report an Issue  |  Privacy  |  Terms of Service