Cost of Finding the Best Split in a Regression Tree

Machine Learning · Medium · Free problem

You are growing a regression (CART) tree. At a node you have $m$ samples and $p$ features, and you want to find the single best split: the feature and threshold that minimize the resulting L2 (sum-of-squared-error) loss of the two children.

  1. What is the complexity of a brute-force evaluation of all candidate splits at this node?
  1. Can you do better than re-summing the loss from scratch for every candidate threshold? What complexity does the improved method achieve?

Hints

  1. Per feature you sort the values and test every threshold; the naive cost is recomputing each child's SSE from scratch.
  2. Write the group SSE as $\sum x^2 - (\sum x)^2/n$ so it depends only on running aggregates.
  3. Sweep the threshold and move one point across at a time, updating count, $\sum x$, $\sum x^2$ in $O(1)$.

Worked Solution

How to Think About It: For each feature you must consider every threshold between consecutive sorted values, and for each threshold you must score the L2 loss of the left/right partition. The naive cost is dominated by (a) sorting each feature and (b) re-computing each child's sum of squares. The win comes from updating the loss incrementally as the threshold sweeps, instead of recomputing it.

Part 1 -- Brute force: For one feature, sort the $m$ values ($O(m \log m)$) and consider $m-1$ thresholds. If you recompute each side's SSE from scratch (an $O(m)$ pass) at every threshold, that is $O(m^2)$ per feature, or $O(m^2 p)$ across all $p$ features. (Some candidates quote $O(m^2 p)$ as the headline brute-force number.)

Part 2 -- Incremental update: Sweep the threshold from left to right, moving one point from the right child to the left child at each step. Maintain running counts and running sums $\sum x$ and $\sum x^2$ on each side. The SSE of a group is $\text{SSE} = \sum x^2 - \frac{(\sum x)^2}{n},$ which you can update in $O(1)$ when a point moves across. So after the initial sort, scoring all thresholds for one feature is $O(m)$, and the per-feature cost is dominated by the sort: $O(m \log m)$. Across all features the best-split search is $O(p\, m \log m)$ -- and if features are pre-sorted once, the per-node sweep is $O(pm)$.

Answer: Brute force is $O(m^2 p)$. Using running sums to update L2 loss in $O(1)$ per threshold brings it down to $O(pm\log m)$ (sort-bound), or $O(pm)$ per node if you keep features pre-sorted.

Intuition

The trick -- expressing variance/SSE via the first two moments $\sum x$ and $\sum x^2$ -- turns an $O(m)$ recompute into an $O(1)$ update, which is the single most important optimization in every practical decision-tree implementation (it is what makes scikit-learn and XGBoost split-finding fast). The broader lesson is that any 'evaluate a statistic over a sliding partition' problem should make you reach for prefix sums or running moments. Pre-sorting features once and reusing the order across the whole tree is the further refinement real libraries use.

Open the full interactive solver →