Subscribe to Vincent Granville's Weekly Digest:

All the hard work we put into the “model” on the right hand side of the model equation is only as accurate as the dependent variable was to start with in reflecting the business problem at hand.  Yet…. No statistics class I ever took said the first word about the dependent variable and, in practice as well, it is often taken “AS IS” with all that that implies. 

 

Since Dependent Variable definitions are highly situation specific I think it would help us all to contribute our anecdotal stories about good things we’ve done when defining the “model’s goal” and let everyone take away what he/she can to their own problems.

 

Some of my own stories:

 

No Dependent Available

Targeting for a new model car:  Once I worked on a targeting project for a new car that had, at that point, never been sold.  In other words there was NO sales history.  A group of managers and myself judgmentally determined how similar the new car was to competitive cars that did have a sales history and then we modeled our similarity sales history as the dependent.  The mix of science and judgment worked quite well in predicting new sales.

 

Bad Dependent Available

Customer Attrition: This is an area that is often modeled poorly, because the initial temptation is to take everyone within a time period and define ALL those who later leave the company as the attritors.   While this sounds okay at first pass, it works poorly because many customers don’t just lease, they phase out little by little.  The problem is that having a model that says everyone who has quickly drawn down their bank balance to $5 will soon leave the bank isn’t very insightful and more importantly it is too late.  A better definition is to count all these ghost accounts as another form of attrition.  This is sometimes resisted by modelers because it will drop their “stated accuracy” (R2 or whatever) like a rock, but it is clearly more useful for the business offer an actionable prediction with a low R2 than a prediction with a high R2 that can’t be used.

 

Many Results but Little History

Soccer Modeling: I modeled European Soccer outcomes for betting purposes for several years.  One of the challenges was that the primary betting result of (Home Win / Draw / Away Win) had very little granularity, but the predictions had to be very accurate in order to beat the odds consistently enough to make money over the long run.  One thing that helped a lot was to do a multi-stage estimation so that first you estimated each teams ability in terms: of Shots Taken, Corners, Fouls, Cards, etc and then used those estimates as predictors of who would win.  It was an effective way to take advantage of both game history and the data structure to get more finely tuned results.

 

PS. I’m looking for an analytic position in the Dallas, Washington DC, or North Carolina areas if you know someone who’s looking.

 

PLEASE SHARE YOUR OWN DEPENDENT VARIABLE STORIES

 

Views: 324

Replies to This Discussion

Hi David. I like your comments, and I agree about your assessment of the dependent variable challenges. As far as attrition is concerned, I also agree, which is why I think that it is important to develop a 'pattern of behaviour' for each customer before he or she attrites: perhaps define attriton as when he or she begins to 'deviate' from that behaviour rather than when the behaviour stops altogether. For example, one might find that, if a customer usually buys a carton of milk every two days at the same store using the same credit card, a first red flag as to whether he or she may attrite may be if we observe that he or she begins to buy every week. Clustering algorithms may be helpful in identifying customers according to 'behaviour patterns'. I have written a blog article about this.

Thanks
Tom, that sounds like good advice. Since I've cross posted the discussion, would it be okay with you if I reposted your comment on other analytic forums? I think the more we share the more we all benefit.
David, I am okay with that. But, please, if you post my advice anywhere else, please create a link to my original comments to you and/or somehow reference me as being the source of the comment with a link to me, okay?

However, no worries about the cross-posting....have a great long weekend, by the way!...I am just gearing up for Labour Day weekend here in Canada.....
Tom, Thanks for the okay. I've have listed you as the author in any case, but actually the link idea sounds like a nice touch that I'll add to all the cross postings. I'll repost a batch of responses in the coming week.

Hi David,

 

I am working on a project to predict prospect customers for new car. The automobile company has 6 million customer records in their data base  for 2 years ( mostly blank as well). How can I predict the prospe3ct customers? Initially I planned to biuld logistic model for that. Suggests your thought......

 

Thanks,

Ravi

REPOSTED WITH PERMISSION OF THE CONTRIBUTOR

Group: Predictive Modeling, Data Mining, Actuary / Actuarial and Statistics
Group Discussion: The Most Important and Least Thought about Variable: The Dependent

Well spoken, sir! While I'm not a real statistician (I'm a programmer), I've certainly seen my share of DPV problems. One key to avoiding such is for the analyst and (especially) his data master to take nothing for granted and to always check things out for themselves, just as any good mechanic would. Data prep may not be glamorous and your client/boss may regard it as something to be minimized, if not avoided altogether, but there really is no substitute for a good initial examination of the data before modeling begins, and a good, modelable dataset created in accordance therewith.I've yet to see a database that was designed for analysis. I'm sure they exist, but I suspect that they're rare.

Posted by John Ries


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR

Group: The R Project for Statistical Computing
Discussion: The Most Important and Least Thought about Variable: The Dependent

I totally agree with the statement of the problem. Interesting how it's done in different fields. For example, in Organizational Psychology, where people frequently deal with courts, the "Criterion problem" is a well-recognized and widely discussed issue. What is "job performance"? How do we operationalize it in order to predict it?

Posted by Dimitri Liakhovitski



REPOSTED WITH PERMISSION OF THE CONTRIBUTOR

Group: Dallas R Users Group
Discussion: The Most Important and Least Thought about Variable: The Dependent

Great comments. I can remember one instance when working for a client that it was misunderstood what we were modeling. We decided on one dependent variable and the client wanted us to work on another dependent variable. Its very important to make sure when working with clients that expectations and assumptions are very clear up front.

Posted by Larry D'Agostino, P.E.


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent

My experience has mostly been in the area of CPG but I have seen many of the problems you mentioned. A few more:

1. Trying to use one definition for multiple decision issues / An example - Category management = These analyses usually start with " estimate the average weekly sales of the product." However - to allocate shelf space, you need to consider the average weekly sales from the shelf area, and ignore the contribution from secondary locations such as end-aisle displays. However, to determine whether the product should be included in the assortment mix, or dropped, you need to include the secondary location sales.
2. Incomplete specification of the dependent - a typical CPG product has multiple UPC's, including special holiday packs or "special packs" that for a brief period of time surplant the "regular item" on shelves (Think Hershey kisses in red valentine, orange thanksgiving, or red and green Christmas wrappers that replace the regular product). If modeling category assortment at UPC level, these must be properly handled.
3. Improper level of aggregation - We are often tempted to "model the data we have" and hope that we can contribute to understanding. However, often the data we have is either too aggregate ( or occasionally too disaggregate) for the problem we are studying. I first encountered this in the study of the relation between advertising expenditures and market sales (or share) in the 60's both with annual data, and bi-monthly Nielsen estimates. Over the years there have been several studies of aggregation bias in price/promotion models conducted on market level models that aggregate over multiple retail chains. Now, I see some examples coming up with people trying to model individual household loyalty marketing data to understand retail sales, when (in my judgement) a higher level of aggregation is called for.

Posted by John Totten


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

@David,
Regarding "No statistics class I ever took said the first word about the dependent variable and, in practice as well, it is often taken “AS IS” with all that that implies," I must point out that your experience does in fact constitute a sample of 1. When I was teaching graduate level statistics and methods (and the latter is crucial in this regard) the concepts of measurement theory, measurement error, and the consequences for the models of errors for both independent and dependent variables was emphasized quite heavily. (Also a sample of 1.)Now, it is still the case that we are often forced to model what are at best imperfect -- and often-times ludicrous -- outcome measures.

Posted by David Mangen
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent

Using a dependent variable which is correlated with the variable you are really researching, or has a "hidden" effect with the real variable. Usually done as a matter of convenience. With regard of homicides. That in itself is often used incorrectly as a dependent variable. Often shows up in studies regarding gun control and can be used to prove either argument. Usually needs a moderating variable to explain.

-Ralph Winters
Posted by Ralph Winters


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

The most I ever heard regarding the dependent variable had to do with issues of its distribution and scale (e.g., interval, normally distributed in the case of regression). And we often fail to sufficiently evaluate the psychometric properties of our outcome measures.

Posted by Barth Riley
I just want to say I think the whole discussion is massively interesting, as much for the way it is capturing attention on LinkedIn as well as Analytic Bridge, which is important because it should be our mission to share and communicate quantitative ideas right now, they are the future, I think.
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent

Thanks for your thoughts on the dependent variable. The dependent variable indeed needs a careful definition and must reflect the target of the study/ business objectives. Here are a few experiences that we have had.

In credit risk, the definition of dependent variable is critical. If you use a late definition and use a lot of intermediate credit risk indicators as predictors, the models will throw what could be obvious.

In retention management, it is very important that you define the dependent variable exactly and as per business requirement. If you use reduced engagement for e.g., then you are including a lot of people who have reduced engagement for reasons other than disinterest or dissatisfaction.

In direct marketing, you have to run several models for different dependent variables through various stages of the acquisition process to understand the engagement and conversions through out.

Posted by Meduri Ravi Kumar


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Analytics, Predictive Modeling & Statistical Analyses Professionals Group
Discussion: The Most Important and Least Thought about Variable: The Dependent

In Medicine, we worry quite a bit about the dependent variable. Check the literature on surrogate outcomes.

Steve Simon, www.pmean.com

Posted by Steve Simon


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: The Most Important and Least Thought about Variable: The Dependent

The dependent variable in statistics is analogous to the labels in supervised learning where we have training cases that are by and large labeled. your experience where you had to devise your own dependent variable resembles experiences in data mining where at times we have to artificially duplicate training cases in less represented classes to avoid training bias. The process of labeling training data is laborious and expensive and at times we have to devise quick methods of labeling in order to develop good models or have to contend with unsupervised learning techniques or additive learning.

Posted by Ernest Mugambi


REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

David,

One of the things that this thread has reminded me of is a structured process that can be used to develop measures, either independent or dependent variables. Long ago I wrote some academic pieces regarding this process, but the intellectual origins (at least to the best of my knowledge) stem from some work done in the mid-1960s by Murray Straus. He coined the phrase, "the rational approach to measurement," to refer to a deductive logical model by which one can conceptually explicate what in fact is meant by any measure that you intend to create.

To briefly illustrate, the process entails starting out at the highest level, and then drilling down and laying out the different dimensions and facets that are applicable to the measure. So, the concept of customer loyalty might start by having behavioral and psychological dimensions. In a multi-tiered organization, you might overlay this with the different divisions of the company -- let's say for this example that the client company (whose loyalty is being assessed) has three different divisions. But you also can take into account the sponsoring organization's divisional structure, which for the sake of this discussion we'll assume has four different divisions.

This produces a 2 x 3 x 4 matrix where you might begin to look for or develop indicators appropriate for each cell of the matrix, and that your total loyalty index might be some combination of these different sub-indices. In some cases, a cell may be a structural zero -- logically impossible -- at which point it can be ruled out.

I certainly do not intend this brief example to constitute a recommendation for how I believe customer loyalty should be measured. What I hope is that it illustrates the logical process that can be used to develop a measure. For what it is worth, when I have used this approach in survey-research related endeavors I have typically found that the psychometric properties of the measures that I develop are quite good. In essence, the logical process forces you to carefully identify what is distinct to each cell of the matrix, and where the confusion exists across cells of the matrix. The process also lends itself quite readily to confirmatory modeling procedures when it comes time to do the statistical analysis of the data.

Posted by David Mangen



REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

Straus, Murray A., 1964 "Measuring families." Pp. 335-400 in Harold T. Christensen (Ed.), Handbook of Marriage and the Family. Chicago:Rand McNally.

Posted by David Mangen
This discussion has turned very, very interesting. Hopefully my post on Link Analysis tomorrow will generate as much discussion....
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

Straus, Murray A., 1964 "Measuring families." Pp. 335-400 in Harold T. Christensen (Ed.), Handbook of Marriage and the Family. Chicago:Rand McNally.

Posted by David Mangen
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: The Most Important and Least Thought about Variable: The Dependent

David,

Your original question applies only to so-called supervised learning /modeling, since unsupervised modeling such as clustering does not have independent variable.

I have been building predictive models for over a decade. True no class will teach you how to do dependent variables. Your value, as an employee of some kind, lies in how to define the dependent variable as related to solving business problems on hand. In a typical modeling project, >80% of the time, at least according to my experience, is on hashing out the definition of the model universe, which is to build the dependent variable. When the universe is built and signed off by your customers, modelers simply start to 'process the model'. Defining the dependent variable is forever an art. Your brain, equiped with all the technical skills, should serve as the intersection of all the important paths, business people, data manager, technology people, vendors, needless to say the all mighty senior management. It is your job to design a dish to make every body as happy as possible.

Posted by Jia (Jason) Xin

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service