Overfitting When Features Approach Sample Size
In a linear regression model with $p$ features and $n$ data points, what problems arise as $p$ approaches $n$?
Specifically: 1. What happens to the in-sample $R^2$ and out-of-sample performance? 2. What happens to the coefficient estimates $\hat{\beta}$? 3. What happens when $p = n$ and when $p > n$? 4. Is there a rule of thumb for a reasonable maximum $p$ relative to $n$ to avoid overfitting?
Hints
- Think about degrees of freedom: a model with $p$ parameters fit to $n$ points has $n - p$ residual degrees of freedom. What happens as that approaches zero?
- For pure noise data, the expected in-sample $R^2$ is approximately $p/n$. What does this imply when $p = n/2$?
- Consider the matrix $X^\top X$ that appears in the OLS formula $\hat{\beta} = (X^\top X)^{-1} X^\top y$. What happens to its condition number as $p \to n$?
Worked Solution
How to Think About It: Think of it this way -- a linear model with $p$ parameters can perfectly fit any $p$ data points (assuming general position). So when $p$ approaches $n$, you are giving the model almost enough knobs to memorize the training data. The fit looks great in-sample, but the model is fitting noise, not signal. This is the single most important concept in applied statistics, and interviewers expect you to articulate it crisply.
Key Insight: The expected in-sample $R^2$ for random (pure noise) data is approximately $p/n$. So if $p = n/2$, you will see $R^2 \approx 0.5$ even when there is zero true relationship. This is the clearest way to see why large $p/n$ is dangerous.
The Method:
1. In-sample vs. out-of-sample performance: - In-sample $R^2$ is biased upward. For noise data, $E[R^2] \approx p/n$. - As $p \to n$, in-sample $R^2 \to 1$, but out-of-sample $R^2$ can go negative (predictions worse than the mean). - The gap between in-sample and out-of-sample performance is the hallmark of overfitting.
2. Coefficient instability: - The OLS estimator variance is $\text{Var}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}$. - As $p \to n$, the matrix $X^\top X$ becomes ill-conditioned. Small changes in the data cause large swings in $\hat{\beta}$. - Individual coefficients become unreliable -- their confidence intervals blow up.
3. Boundary cases: - $p = n$: Perfect fit. $R^2 = 1$, zero residuals, but the model is completely memorizing data. The model has zero degrees of freedom for estimating $\sigma^2$. - $p > n$: $X^\top X$ is singular (rank at most $n < p$). OLS has infinitely many solutions -- the system is underdetermined. You cannot even run standard OLS.
4. Rule of thumb: - Conservative: $p \le n/10$ to $n/20$ for reliable inference. - The adjusted $R^2$ formula, $R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n - p - 1}$, penalizes for $p$ and gives a more honest assessment. - Cross-validated $R^2$ is even better -- it directly measures out-of-sample performance.
Practical Considerations: - When $p$ is large relative to $n$, use regularization: Ridge ($L_2$) shrinks coefficients toward zero, Lasso ($L_1$) sets some to exactly zero (feature selection). - PCA can reduce dimensionality before regression. - In finance, this is especially relevant: you often have hundreds of candidate factors but only a few years of monthly returns ($n \approx 60$). Running a kitchen-sink regression is a recipe for data mining.
Answer: As $p \to n$, in-sample $R^2$ inflates toward 1 (with $E[R^2] \approx p/n$ for noise), coefficients become unstable due to ill-conditioning of $X^\top X$, and out-of-sample performance degrades. At $p = n$, the fit is perfect but meaningless; at $p > n$, OLS is undefined. Rule of thumb: keep $p \le n/10$ to $n/20$, or use regularization.
Intuition
The core lesson is that in-sample fit is cheap. Any model with enough parameters can memorize training data -- the question is whether it has learned anything generalizable. The ratio $p/n$ is the simplest diagnostic: when it is large, you should be deeply skeptical of in-sample performance metrics. Adjusted $R^2$ and cross-validation exist precisely to correct for this.
In quant finance this is a constant battle. You have a universe of potential alpha signals (features) and limited history (data points). Every backtest that looks too good probably has a high effective $p/n$, especially after accounting for the signals you tried and discarded. The rule of thumb $p \le n/10$ is a starting point, but the deeper principle is: always evaluate out-of-sample, and be honest about how many things you tried.