Limitations of R-Squared
Consider $R^2 = 1 - SSE/SST$ as a measure of regression model quality. Identify and explain its key limitations. Specifically, address:
- How does $R^2$ handle large residual errors versus small ones?
- Can two models with very different absolute error magnitudes have the same $R^2$?
- What happens to $R^2$ when you add more predictors to the model, regardless of whether they are genuinely useful?
For each point, explain why it is (or is not) a real limitation, and suggest what you would use instead.
Hints
- Think about what $R^2$ actually measures -- it is a ratio of sums of squares. What information is lost when you take a ratio?
- Consider what happens to $SSE$ when you add a new predictor. Can the OLS solution with more variables ever do worse than with fewer?
- Compare adjusted $R^2$ to raw $R^2$. The degrees-of-freedom correction introduces a penalty for complexity -- when does adding a variable fail to clear that bar?
Worked Solution
How to Think About It: $R^2$ tells you what fraction of the variance in $y$ your model explains. It sounds great, but it has blind spots that can mislead you if you rely on it uncritically. The core issue is that $R^2$ is a ratio -- it does not tell you about the absolute magnitude of errors, it punishes large errors disproportionately (because of squaring), and it can only go up when you add more variables. Understanding these limitations is essential for anyone building regression models in practice.
Key Insight: $R^2$ is a relative measure of fit that ignores model complexity. It will always reward adding more regressors, even useless ones, because the unrestricted model nests the restricted one.
The Method:
Limitation 1: Sensitivity to outliers (squared residuals)
$R^2$ depends on $SSE = \sum (y_i - \hat{y}_i)^2$, which squares the residuals. A single large error contributes disproportionately. For example, one residual of 10 contributes the same to $SSE$ as one hundred residuals of 1 -- even though the latter represents a far worse model on average. This means $R^2$ can be heavily influenced by a few outliers, either inflating or deflating it depending on where the outliers fall.
Limitation 2: $R^2$ is scale-independent (ratio problem)
Since $R^2 = 1 - SSE/SST$, it is a ratio. Two models can have the same $R^2$ despite vastly different absolute error levels. Consider: Model A has $SSE = 40$ and $SST = 100$, while Model B has $SSE = 4000$ and $SST = 10{,}000$. Both give $R^2 = 0.6$, but Model B has errors 100 times larger in magnitude. In practice, the scale of your errors matters enormously -- a pricing model with $R^2 = 0.6$ and residuals in pennies is very different from one with $R^2 = 0.6$ and residuals in dollars.
Limitation 3: $R^2$ never decreases when adding predictors
This is the most important limitation. Adding any regressor -- even pure noise -- cannot decrease $R^2$. The reason: OLS with $k+1$ regressors minimizes $SSE$ over a strictly larger set of coefficient vectors (it can always set the new coefficient to zero). So $SSE_{k+1} \le SSE_k$, which means $R^2_{k+1} \ge R^2_k$. This makes $R^2$ useless for model selection because it always prefers the most complex model.
What to use instead:
- Adjusted $R^2$: penalizes for additional regressors. A new variable must achieve $|t| > 1$ to improve adjusted $R^2$.
- Information criteria (AIC, BIC): penalize complexity more formally. BIC is more conservative.
- Out-of-sample metrics: cross-validated $R^2$ or RMSE on held-out data.
- RMSE or MAE: give you absolute error magnitudes, not just relative fit.
Answer: The main limitations of $R^2$ are: (1) it squares residuals, making it sensitive to outliers; (2) it is a ratio, so it does not reflect absolute error magnitude; and (3) it is monotonically non-decreasing in the number of regressors, making it unsuitable for model selection. Use adjusted $R^2$, information criteria, or out-of-sample metrics instead.
Intuition
The fundamental issue with $R^2$ is that it conflates two things: how well the model fits and how complex the model is. A model with 99 regressors and 100 observations will have $R^2 = 1$ even if every regressor is pure noise -- you are just fitting the noise perfectly. This is overfitting, and $R^2$ cannot detect it because it only looks at in-sample fit.
In quant finance, this trap is especially dangerous. Factor models for returns are notorious for looking great in-sample (high $R^2$ with many factors) but failing out-of-sample. The discipline of using adjusted $R^2$, AIC/BIC, or walk-forward cross-validation is what separates a good quant model from a curve-fit. Always ask: does the model predict well on data it has never seen, not just on the data used to build it?