Cost of Finding the Best Split in a Regression Tree

Question

You are growing a regression (CART) tree. At a node you have $m$ samples and $p$ features, and you want to find the single best split: the feature and threshold that minimize the resulting L2 (sum-of-squared-error) loss of the two children. 1. What is the complexity of a brute-force evaluation of a…

Accepted Answer

How to Think About It: For each feature you must consider every threshold between consecutive sorted values, and for each threshold you must score the L2 loss of the left/right partition. The naive cost is dominated by (a) sorting each feature and (b) re-computing each child's sum of squares. The win comes from updating the loss incrementally as the threshold sweeps, instead of recomputing it. Part 1 -- Brute force: For one feature, sort the $m$ values ($O(m \log m)$) and consider $m-1$ thresholds. If you recompute each side's SSE from scratch (an $O(m)$ pass) at every threshold, that is $O(m^2)$ per feature, or $O(m^2 p)$ across all $p$ features. (Some candidates quote $O(m^2 p)$ as the headline brute-force number.) Part 2 -- Incremental update: Sweep the threshold from left to right, moving one point from the right child to the left child at each step. Maintain running counts and running sums $\sum x$ and $\sum x^2$ on each side. The SSE of a group is $	ext{SSE} = \sum x^2 - \frac{(\sum x)^2}{n},$ which you can update in $O(1)$ when a point moves across. So after the initial so…

Cost of Finding the Best Split in a Regression Tree

Hints

Worked Solution

Intuition