Overfitting When Features Approach Sample Size

Question

In a linear regression model with $p$ features and $n$ data points, what problems arise as $p$ approaches $n$?

Specifically: 1. What happens to the in-sample $R^2$ and out-of-sample performance? 2. What happens to the coefficient estimates $\hat{\beta}$? 3. What happens when $p = n$ and when $p > n$? 4. Is there a rule of thumb for a reasonable maximum $p$ relative to $n$ to avoid overfitting?

Accepted Answer

How to Think About It: Think of it this way -- a linear model with $p$ parameters can perfectly fit any $p$ data points (assuming general position). So when $p$ approaches $n$, you are giving the model almost enough knobs to memorize the training data. The fit looks great in-sample, but the model is fitting noise, not signal. This is the single most important concept in applied statistics, and interviewers expect you to articulate it crisply. Key Insight: The expected in-sample $R^2$ for random (pure noise) data is approximately $p/n$. So if $p = n/2$, you will see $R^2 \approx 0.5$ even when there is zero true relationship. This is the clearest way to see why large $p/n$ is dangerous. The Method: 1. In-sample vs. out-of-sample performance: - In-sample $R^2$ is biased upward. For noise data, $E[R^2] \approx p/n$. - As $p 	o n$, in-sample $R^2 	o 1$, but out-of-sample $R^2$ can go negative (predictions worse than the mean). - The gap between in-sample and out-of-sample performance is the hallmark of overfitting. 2. Coefficient instability: - The OLS estimator variance is $	ext{Var…

Overfitting When Features Approach Sample Size

Hints

Worked Solution

Intuition