I've been trying to understand the CART algorithm for classification trees, and I've got a rudimentary implementation up and running, but generating all possible splits to consider across all variables is computationally inefficient. Right now I'm doing an exhaustive search which means it gets out of hand past when a variable has 10 classes to consider (2 ** 10 - 1 things to consider yikes!).
I know there must be a way to reduce the search space to find an optimal split, but I can't seem to find a straight forward explanation in the book. Is the idea here to use sub sampling when predictor's unique values get large enough?
The next part is a crazy idea I had, and I wonder if this might work please be gentle. :-) Thinking about it by my self I thought I know what a pure split might be. So what if I built a couple of sets that are pure from the target response then worked it backwards to find the right predictor's split. For example say I have the following data set where P1 is a predictor and T1 is the target response where both are categorical:
P1 T1
a -
b +
a -
d +
c -
b +
e +
So I would split T1 it into two pure sets + = ( b, d, e ) and - = ( a,c ). Calls those two sets TL and TR respectively. Since there's no overlap between those two sets I've found the best split. What do we do if there is overlap
P1 T1
a -
a +
b +
b -
b -
c +
c +
d -
e -
e -
So doing the same thing we'd get TR = ( a, b, c ) and TL = ( a, b, d, e ). Now we need to make a trade off because we have overlap between these two sets. Either by taking a & b out of TR's set, or TL's set. The set of overlap (a&b) will add some amount of impurity to either set. At this point I could calculate the impurity of putting the overlap set with TR or TL, and pick the best one. For multi-class target responses it would be still be 2 ** L - 1 combinations worst case, but by concentrating on the overlap only it would be L - number of values fitting into a pure set. Do this across all of the predictors and compare their relative gain of purity.
I think this first would limit my big search to just the target response's size. And, it's dividing up the what's pure from the impure, and just doing permutations on the impure rather than the whole space. I guess worst case is that you'd end up doing everything. However, you could balance that with other predictors that have less impurity and expand those first, and maybe forget the one that's highly impure.
I know this is well trodden territory so what I'm suggesting probably won't work.
I'd still appreciate any help pointing me in the right direction.
Thanks,
Charlie