Variance Reduction in Random Forests via Feature Subsampling

Machine Learning · Medium · Free problem

Consider an ensemble method that averages $B$ i.i.d. base predictors, each with variance $\sigma^2$ and pairwise correlation $\rho$.

1. Show that the variance of the ensemble prediction $\hat{f}$ is: $\text{Var}(\hat{f}) = \rho \sigma^2 + \frac{(1 - \rho)}{B} \sigma^2$

In Random Forests, each tree is built using a random subset of $m$ candidate features at each split (out of $p$ total features). This feature subsampling reduces the pairwise correlation $\rho$ between trees. Suppose that in a simplified model, $\rho = \rho(m)$ is a decreasing function of $m$ being small (i.e., fewer candidate features per split means less correlated trees), while the bias of each individual tree increases as $m$ decreases (because each tree sees fewer features and fits a weaker model).

Explain the variance-bias tradeoff as $m$ varies from

$ to $p$, and describe a principled method for choosing $m$ that minimizes an oracle estimate of test MSE.

Hints

Expand $\text{Var}(\frac{1}{B}\sum T_b)$ by writing out the full double sum of covariances -- separate the diagonal and off-diagonal terms.
For the variance floor, think about what happens as $B \to \infty$. The term that survives is controlled entirely by $\rho$, which is what feature subsampling targets.
Random Forests provide a free validation estimate via out-of-bag (OOB) predictions. Use OOB MSE evaluated over a grid of $m$ values to pick the optimal feature subsample size.

Worked Solution

How to Think About It: This problem gets at the fundamental reason Random Forests work. When you average $B$ things that are positively correlated, you do not get the full

/B$ variance reduction you would with independent averages. The residual correlation $\rho$ puts a floor on how much averaging can help. Random Forests attack this by deliberately decorrelating the trees -- each tree only gets to look at $m$ features per split, so different trees end up using different splits and making different errors. The catch: restricting features makes each individual tree worse (higher bias). The art is finding the sweet spot.

Key Insight: The ensemble variance formula has two terms: $\rho \sigma^2$ (irreducible by adding more trees) and $(1-\rho)\sigma^2 / B$ (shrinks with more trees). Feature subsampling targets the first term by reducing $\rho$.

Part (i): Deriving the Ensemble Variance

Let $\hat{f} = \frac{1}{B} \sum_{b=1}^{B} T_b$, where each tree $T_b$ has $\text{Var}(T_b) = \sigma^2$ and $\text{Corr}(T_b, T_{b'}) = \rho$ for $b \neq b'$.

$\text{Var}(\hat{f}) = \text{Var}\!\left(\frac{1}{B} \sum_{b=1}^{B} T_b\right) = \frac{1}{B^2} \sum_{b=1}^{B} \sum_{b'=1}^{B} \text{Cov}(T_b, T_{b'})$

The double sum has $B$ diagonal terms (where $b = b'$), each contributing $\sigma^2$, and $B(B-1)$ off-diagonal terms, each contributing $\rho \sigma^2$:

$\text{Var}(\hat{f}) = \frac{1}{B^2} \left[ B \sigma^2 + B(B-1) \rho \sigma^2 \right]$

$= \frac{\sigma^2}{B} + \frac{(B-1)}{B} \rho \sigma^2$

$= \rho \sigma^2 + \frac{(1-\rho)}{B} \sigma^2$

The last step rearranges by writing $\frac{1}{B} = \frac{\rho}{B} + \frac{1-\rho}{B}$ and $\frac{B-1}{B}\rho = \rho - \frac{\rho}{B}$, so combining gives $\rho \sigma^2 + \frac{(1-\rho)\sigma^2}{B}$.

As $B \to \infty$, the second term vanishes and $\text{Var}(\hat{f}) \to \rho \sigma^2$. This is the variance floor -- no amount of bagging can push below it.

Part (ii): The Variance-Bias Tradeoff and Choosing $m$

The test MSE decomposes as: $\text{MSE} = \text{Bias}^2 + \text{Variance}$

As $m$ varies:

Small $m$ (e.g., $m = 1$): Each tree only sees one feature per split. Trees are highly decorrelated ($\rho$ is small), so the variance term $\rho \sigma^2$ is low. But each tree is a very weak learner -- it cannot capture complex interactions -- so individual tree bias is high. The squared bias dominates.

Large $m$ (e.g., $m = p$): Each tree sees all features. This is just bagging -- every tree tends to pick the same strong features first, so $\rho$ is high. The variance floor $\rho \sigma^2$ is large and does not shrink much with $B$. However, each tree is a strong learner with low bias.

Intermediate $m$: The sweet spot. Trees are reasonably decorrelated (moderate $\rho$) while still being competent predictors (moderate bias). The total MSE is minimized.

Principled choice of $m$:

Use out-of-bag (OOB) error as an oracle estimate of test MSE. Since each tree is trained on a bootstrap sample, roughly

/e \approx 37\%$ of observations are left out of each tree. Predict each observation using only the trees that did not see it, and compute the OOB MSE. Then:

Train Random Forests for a grid of $m$ values (common choices: $m \in \{\lfloor \sqrt{p} \rfloor, \lfloor p/3 \rfloor, \lfloor p/2 \rfloor, p\}$ for regression, or finer grids).
For each $m$, compute the OOB MSE.
Select the $m$ that minimizes OOB MSE.

This is principled because OOB error is an honest estimate of generalization error -- it only evaluates each observation on trees that never trained on it -- and it comes for free without needing a separate validation set.

Practical rules of thumb: For regression, $m = p/3$ is a common default. For classification, $m = \sqrt{p}$. But these are starting points -- the OOB procedure above should be used for any serious application.

Answer: The ensemble variance is $\text{Var}(\hat{f}) = \rho \sigma^2 + (1-\rho)\sigma^2 / B$. Decreasing $m$ reduces $\rho$ (lowering the variance floor) but increases individual tree bias. The optimal $m$ is chosen by minimizing OOB MSE over a grid of candidate values, balancing the variance reduction from decorrelation against the bias increase from feature restriction.

Intuition

The core lesson here is that averaging correlated predictors gives diminishing returns. If all your trees make the same mistakes (high $\rho$), averaging a thousand of them barely helps. The genius of Random Forests is attacking this correlation directly: by randomly restricting which features each tree can use at each split, you force the trees to find different patterns in the data, making their errors more independent. This is the same principle behind portfolio diversification -- you want assets (or predictors) that do not all move together.

The tradeoff is real and unavoidable. Decorrelating the trees too aggressively (very small $m$) makes each tree so weak that even a perfect ensemble of weak learners cannot recover. The practical takeaway: defaults like $m = p/3$ for regression and $m = \sqrt{p}$ for classification work surprisingly well across many problems, but always validate with OOB error. In quant work, this same decorrelation principle shows up whenever you combine multiple alpha signals or model forecasts -- you want diversity in your ensemble, but not at the cost of each component being useless.

Open the full interactive solver →