Subscribe to Vincent Granville's Weekly Digest:

'Opt-out' Customer Profiling Analysis

Inductive Decision Tree (IDT) or Logistic Regression analysis may be used to identify individuals or households who should NOT be the target of direct marketing campaigns, thus increasing response rates, lowering cost-per-click e-mail costs and increasing the return on investment (ROI) per responder. Conducting this analysis requires first collecting data on customers who do not wish to be directly marketed to. In Canada, a person may request that he or she be exempt from direct marketing campaigns in one or more of the following ways:

- 'opting out' of an e-mail marketing campaign
- returning a hand delivered letter marked 'Do Not Mail' or 'Return to Sender' via Canada Post
- contacting a customer call centre representative and asking to be exempt from future marketing campaigns
- registering his or her name on the Government of Canada's 'National Do Not Call' list

A data analyst may use one or more of the above data sources to create a boolean 'Contact?' field for each customer record--uniquely identified by an e-mail address, phone number or mailing address. FSA (first three digits of a Canadian postal code) may be used as a unique identifier when e-mail, phone number or full mailing address is not present. In this case, the number of 'Contact? = 'Yes' instances may be calculated for each FSA, and a calculated index variable 'Contact?_FSA_Index' can be the target variable profiled.

At this point, an important decision must be made: will the analysis try to answer the general question, 'Who does not want to be directly marketed to by us via any channel?' or will three different analyses be conducted for e-mail, regular mail and telemarketing respectively?

For our purposes, let's focus on identifying individuals who have opted out of e-mail marketing campaigns and compare them to a sample who have not. A stratified sampling technique will ensure an adequate number of 'Yes' and 'No' values for the 'Contact?' variable if e-mail opt-out customers are a significantly smaller segment.
Behavioural variables such as websites visited, ads clicked on, and time spent on the web may be used to profile the difference between 'Yes' and 'No' Contact? groups. Depending on the e-mail provider, detailed information for each accountholder may also be available: such as gender, age and postal code.

At the very least, FSA should be a standard field of information tagged onto every e-mail customer's record before he or she is marketed to: it should be kept populated by e-mail marketers. Demographic, attitudinal and psychographic data from Generation5, Environics Analytics and MapInfo may be appended onto customer records which contain a valid FSA.

An Inductive Decision Tree CHAID Algorithm can then help profile the significant difference (if any) between customers who have opted out of e-mail marketing campaigns and those who have not. Segments containing e-mail accountholders where the probability of an e-mail opt out is, for example, .8 or greater should be removed from future e-mail marketing lists.


If the model holds true, future e-mail marketing campaigns will generate a higher response rate.

Views: 234

Tags: analysis, call, customer, datamining, do, list, not, opt-out, profiling

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Tom Wolfer on August 23, 2010 at 3:20pm
Arun, some very, very interesting thoughts here, and, I will do my best to address them based on what I know, and, what I would be doing with the results.

1. I can't really comment on whether you would get a better model results if you attempt to predict N vs. R. The only way to know this for sure is to try the model both ways, shaping and using the data that you have available. I know that this may not help much, however, I don't want to mislead.

However, to your point about the size of the N vs. NR sample and how it impacts your decision on who to model, I can say this. Even if I do have a small sample (say 5,000 NRs) and a large sample (say 100,000 Rs), I can still choose to model NRs. I may generate a random sample of 5,000 Rs and compare this subset to the full set of NRs. In this case, I could not use the absolute probabilities that are generated by a CHAID IDT. However, at least with an IDT, I could isolate segments in which the probability of an NR was statistically increased based on certain characteristics: eg. age, income, average # e-mail logins per week, etc. I think the same could also be said for a logistic regression. For example, we might not be able to say that, in absolute terms, we can isolate a population in which 80% will be Rs or NRs. However, I do believe that, using your decile solution, we could still very effectively isolate the deciles in which the probability of finding an R or NR in the customer population is RELATIVELY much higher than in the 50/50 split of the sample that we used to develop the model. If, for example, the highest relative NR population probability is contained in deciles 1-3 as a result of our model, then, we could select the entire set of customers who fell within deciles 1-3 when the model was scored to the full customer set. Does this make sense? I am not as well schooled in logistic as I used to be: have not used it a lot lately.

3. As for your question about whether to use CHAID vs. logistic? Well, I was always tuaght to use one as a check against the other to see which is more accurate. However, the great thing about IDT's is that their results are very easy to understand and explain, and, this is very important when attempting to get business managers to understand the value of the result - so they will implement the model. And, this, afterall, is what I see as success - model implementation!
Comment by Arun on August 23, 2010 at 10:46am
Tom,

Thanks for this amazing post, it's got me thinking on some stuff I've been wanting to understand lately on Logistic Regression.

I'll try to pen down my thoughts here...

In a 'Response Modeling' framework, where we use a Logistic Regression to predict a responder we end up having a model with a 'lift' over the random model. In essence, we have improved the selection of customers to ensure we capture more of the responsive customers i.e. inverse of (less of non-responsive customers), which I think is what the opt-out model gets at.

My questions -
1. Both models are going to increase my response rate, but which one would be better suited?
2. Am I better off, trying to predict a responder or a non-responder?

The size of responder to non-responder makes a difference in the probability estimates. So, is there a situation where I can use one method over the other? For eg, when I have a R:NR = 5%:95%, I can choose to predict 'R' since then, I can still use top 6-7 deciles for targeting, thereby removing the likely NRs.

3. When I have R very less compared to NR (or vice versa), modeling for R (or NR) tends predict NR with better accuracy - this seems to be the case since we've got 95% (in the case of 5:95) of the data which will be a '0', hence misclassification is greatly reduced.
Why is this so? Is there a way to increase the prediction accuracy & misclassification rate (ROC curves not to be used)?
I've heard of a CHAID before Logistic being a good indicator & booster of model's performance.

There's lot of thought involved, and I'm currently reading up a lot and discussing where possible! :)
Any help would be much appreciated and mutually helpful!

Thanks,
Arun

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service