Subscribe to Vincent Granville's Weekly Digest:

All the hard work we put into the “model” on the right hand side of the model equation is only as accurate as the dependent variable was to start with in reflecting the business problem at hand.  Yet…. No statistics class I ever took said the first word about the dependent variable and, in practice as well, it is often taken “AS IS” with all that that implies. 

 

Since Dependent Variable definitions are highly situation specific I think it would help us all to contribute our anecdotal stories about good things we’ve done when defining the “model’s goal” and let everyone take away what he/she can to their own problems.

 

Some of my own stories:

 

No Dependent Available

Targeting for a new model car:  Once I worked on a targeting project for a new car that had, at that point, never been sold.  In other words there was NO sales history.  A group of managers and myself judgmentally determined how similar the new car was to competitive cars that did have a sales history and then we modeled our similarity sales history as the dependent.  The mix of science and judgment worked quite well in predicting new sales.

 

Bad Dependent Available

Customer Attrition: This is an area that is often modeled poorly, because the initial temptation is to take everyone within a time period and define ALL those who later leave the company as the attritors.   While this sounds okay at first pass, it works poorly because many customers don’t just lease, they phase out little by little.  The problem is that having a model that says everyone who has quickly drawn down their bank balance to $5 will soon leave the bank isn’t very insightful and more importantly it is too late.  A better definition is to count all these ghost accounts as another form of attrition.  This is sometimes resisted by modelers because it will drop their “stated accuracy” (R2 or whatever) like a rock, but it is clearly more useful for the business offer an actionable prediction with a low R2 than a prediction with a high R2 that can’t be used.

 

Many Results but Little History

Soccer Modeling: I modeled European Soccer outcomes for betting purposes for several years.  One of the challenges was that the primary betting result of (Home Win / Draw / Away Win) had very little granularity, but the predictions had to be very accurate in order to beat the odds consistently enough to make money over the long run.  One thing that helped a lot was to do a multi-stage estimation so that first you estimated each teams ability in terms: of Shots Taken, Corners, Fouls, Cards, etc and then used those estimates as predictors of who would win.  It was an effective way to take advantage of both game history and the data structure to get more finely tuned results.

 

PS. I’m looking for an analytic position in the Dallas, Washington DC, or North Carolina areas if you know someone who’s looking.

 

PLEASE SHARE YOUR OWN DEPENDENT VARIABLE STORIES

 

Views: 318

Replies to This Discussion

Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent

I think this is a great topic, and even more so I agree that the Y is often the most disregarded element. One approach I have used is to try to modify it (with respect to the business need) and convert it to different level of granularity or explore a natural hierarchy that the variable exhibits. For example a buying amount can be decomposed to model for buying and a model for buying amount etc. Additionally in such situation the continuous variable might be binned to stabilize the process while still be actionable.

David: I'm kind of curious on your soccer modeling.. being from Europe it strikes a cord ;) can you share more detail if that's possible?

Posted by Georgi Georgiev
Georgi,

I stopped betting professionally a couple of years ago when the largest Betting Exchange, Betfair, opened a sister company in Malta that bet its own money agaist the exchange participants like a bookmarker does. This seemed to coincide with a vast improvement in the accuracy of the betting prices/odds and didn't leave me enough margin to cover the 3-5% comission on winning bets. If you paid for the more detailed Carling Opta data perhaps you could still out predict them, but I didn't want to up the stakes just when everything was going sour. I'd made 289% on funds in 2004, but I think those days are gone forever.

That said, the models were systems of non-linear equations that had the basic form (Team A Strength less Team B Strength)= Team A's # of shots taken at goal. You had to estimate these strengths across games because of course all the estimates were interdependent. i.e. It's easier to get a shot at Wolverhampton's goal than it is to get one at Man United's. Step two was to take everyone's strengths on different game attributes and use those as the predictors in an equation to predict the win. There was a lot more to it than this, but that was the basic strategy.
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent

In my experience, the dependent variable has to be formulated with respect to the business issue being researched. Besides this and the usual challenges of data availability, time, organizational dynamics and cost, I think the issue of level vs. % change vs. change is area of challenge. So regardless of metric (e.g. sales), I find using % change vs. level can result in different results, some subtle but some important not to overlook.

Posted by Shwetal Patel
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent

I agree that dependent variable is at times loosely defined, but definitely not least thought. Most of the effort culminates when a business leader / stakeholder tracks the performance and calls upon a need for corrective measure / improvement. In other words, the objective function is directly related to business problem in hand. Quantitative methods heavily rely on correctness of data. In most cases, a closer look at data is required to arrive and agree upon the definition. If one were to model likelihood of a customer to be 'Hit and Run' (use the credit card just once), what should be the ideal wait period- 3/6/9/12 months? It can not be indefinite. We need to look at data to make an assertive statement such as 'if a customer has not turned back in 90 days, its highly unlikely that he is coming back ever'. Of course it is possible that a few customers turn up, but the impact would be meager. There can only be an optimum definition like optimum solution in predictive modeling or optimization. The trade-off is necessary to tackle bigger issue. A quick solution is required to mitigate losses / utilize opportunity at the earliest. One can always re-visit the implemented strategy and fine tune it based on its performance. For example, one might find that new customers (early months on book) have different behavior to older accounts; a segmented approach can be taken to handle such cases. Its a continuous improvement process. Re-iterating my point, one can only hope for 'optimum' solution and not a 'perfect' solution, at least when dealing with predictive analytics. Adequate validation (such as testing the model on out of time sample) and simulation is essential to test the predictive power and expected impact of any solution.

Working on real time business cases would help grasp the nuances of technique in any course. I find 'on the job training' (OJT) the most effective way of learning. Of course one needs to know the theory else the 'analytics' in itself would seem to be black box.

Posted by Vinodh Kumar

Excellent thought provoking article, thanks

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service