In large datasets, the reduction of the sheer volume of variables in order to simplify a model can be a rather cumbersome task, often calling both the art and science of data mining simultaneously into practice. Simplification of a model in development can hold the additional challenge of ‘explainability’, which falls squarely onto the shoulders of the analyst or analytic team. For the second problem, there are rarely easy answers – but what about the first? The problem of initially choosing variables (among many) for input into a developing model has been filled with controversy (take, for example, stepwise regression). This problem arguably boils down to both methods, and strategy for using those methods. What early strategies and methods are you applying to large datasets to optimize initial identification of predictors? What strategies and methods are you employing reduce the number of variables in the model?
Secondly, simplification of the parameters (‘levels’) of each variable can be an equally arduous task. Take for instance continuous variables. Whereas the ‘manual’ grouping of parameters within a variable (i.e. according to their distances from some pre-defined value) may be appropriate for a low number of parameters, curve-fitting (i.e. via the use of various orders of orthogonal polynomials) may be more appropriate for higher numbers of parameters within the variable. The question here is, what is the optimal cutoff for ‘high’ or ‘low’. In other words, at what point (perhaps a specific number of parameters) do you believe that it is more optimal to fit a curve versus a ‘manual’ form of grouping?
Finally, and back to the issue of variable reduction, the question must arise when ‘enough is enough’. Variable reduction in datasets with high dimensionality can certainly be an experiment in madness. Aside from the business constraints overlying the construction of the model and indicators of ‘overfitting’(are there any other constraints?), what helps you determine when a model in development is ‘good enough’? What strategies have you employed to defend what is ‘good enough’?
Tags: curve fitting, dimension reduction, grouping, large dataset, parameterization, variable reduction
Share
-
▶ Reply to This