AnalyticBridge

Social Network For Analytic Professionals

In large datasets, the reduction of the sheer volume of variables in order to simplify a model can be a rather cumbersome task, often calling both the art and science of data mining simultaneously into practice. Simplification of a model in development can hold the additional challenge of ‘explainability’, which falls squarely onto the shoulders of the analyst or analytic team. For the second problem, there are rarely easy answers – but what about the first? The problem of initially choosing variables (among many) for input into a developing model has been filled with controversy (take, for example, stepwise regression). This problem arguably boils down to both methods, and strategy for using those methods. What early strategies and methods are you applying to large datasets to optimize initial identification of predictors? What strategies and methods are you employing reduce the number of variables in the model?

Secondly, simplification of the parameters (‘levels’) of each variable can be an equally arduous task. Take for instance continuous variables. Whereas the ‘manual’ grouping of parameters within a variable (i.e. according to their distances from some pre-defined value) may be appropriate for a low number of parameters, curve-fitting (i.e. via the use of various orders of orthogonal polynomials) may be more appropriate for higher numbers of parameters within the variable. The question here is, what is the optimal cutoff for ‘high’ or ‘low’. In other words, at what point (perhaps a specific number of parameters) do you believe that it is more optimal to fit a curve versus a ‘manual’ form of grouping?

Finally, and back to the issue of variable reduction, the question must arise when ‘enough is enough’. Variable reduction in datasets with high dimensionality can certainly be an experiment in madness. Aside from the business constraints overlying the construction of the model and indicators of ‘overfitting’(are there any other constraints?), what helps you determine when a model in development is ‘good enough’? What strategies have you employed to defend what is ‘good enough’?

Tags: curve fitting, dimension reduction, grouping, large dataset, parameterization, variable reduction

Share

Reply to This

Replies to This Discussion

In my experience, datasets are resources, not simply subjects of summary. Also, summaries are goal oriented, at least in terms of measures used to assess lossiness in a summary.

I have used repeated subsampling to assess stability of a conclusion throughout the dataset. I have also explored adapting fast marching methods to characterize the topography outside primary modes.

Reply to This

RSS

Featured


Advertisement

© 2010   Created by Vincent Granville

Badges  |  Report an Issue  |  Privacy  |  Terms of Service