Common Regression Pathologies and Fixes

Regression · Medium · Free problem

You are reviewing a colleague's regression analysis and notice the results look suspicious -- some coefficients have unexpected signs, the residual plots look off, and the standard errors seem too small.

Walk through the most common issues that arise in regression analysis. For each one:

  1. What does it look like (how do you detect it)?
  2. Why does it cause problems (what goes wrong with OLS)?
  3. How do you fix it?

Cover at least: multicollinearity, heteroscedasticity, autocorrelation, non-linearity, outliers, omitted variable bias, and endogeneity.

Hints

  1. Start by classifying issues into two buckets: those that bias your coefficients (omitted variables, endogeneity, non-linearity) and those that only affect your standard errors (heteroscedasticity, autocorrelation, multicollinearity). This distinction determines how urgent the fix is.
  2. For each issue, think about which OLS assumption it violates. The Gauss-Markov assumptions are: linearity, strict exogeneity ($E[\epsilon | X] = 0$), homoscedasticity, no serial correlation, and full rank.
  3. Residual plots are your first line of defense. Plot residuals vs. fitted values (heteroscedasticity, non-linearity), residuals vs. time (autocorrelation), and check VIF (multicollinearity) before interpreting any coefficient.

Worked Solution

How to Think About It: OLS is built on a set of assumptions -- linearity, independence, homoscedasticity, no perfect multicollinearity, exogeneity -- and when these assumptions fail, different things break. Some violations affect coefficient estimates (bias), some affect standard errors (wrong inference), and some affect both. The key to diagnosing regression problems is knowing which assumption is violated and what breaks as a result. A good practitioner checks residual plots before looking at any coefficients.

Key Insight: Not all regression problems are created equal. Multicollinearity and heteroscedasticity affect precision but not bias -- your coefficients are still unbiased, just noisy or with wrong standard errors. Omitted variables and endogeneity create bias -- your coefficients are systematically wrong. This distinction matters enormously for what you do about it.

The Method:

1. Multicollinearity - *Detection:* Variance Inflation Factor (VIF) for each predictor. VIF

Loading problems...
gt; 5$ is a warning; VIF
Loading problems...
gt; 10$ is a serious problem. Also look for coefficients that flip signs or have huge standard errors while the overall $R^2$ is high. - *What breaks:* OLS is still unbiased, but individual coefficient estimates have inflated variance. $(X^T X)^{-1}$ has large entries when columns of $X$ are nearly collinear. - *Fixes:* Ridge regression (adds $\lambda I$ to $X^T X$, shrinking coefficients). PCA regression (replace correlated predictors with orthogonal components). Or simply remove one of the collinear predictors if you do not need both.

2. Heteroscedasticity - *Detection:* Breusch-Pagan test (regress squared residuals on predictors). Residual-vs-fitted plot showing a fan or cone shape. - *What breaks:* OLS coefficients are still unbiased and consistent, but the standard errors are wrong. Typically, SEs are understated, leading to false significance. - *Fixes:* Use heteroscedasticity-robust standard errors (White/HC standard errors -- always a good default). Weighted Least Squares (WLS) if you know the variance function. Log-transform the dependent variable if the variance scales with the level.

3. Autocorrelation (Serial Correlation) - *Detection:* Durbin-Watson statistic (values near 2 are good; near 0 or 4 indicate positive or negative autocorrelation). ACF/PACF plots of residuals. - *What breaks:* Like heteroscedasticity, OLS coefficients are unbiased but standard errors are wrong. With positive autocorrelation, SEs are too small (effective sample size is smaller than $n$). - *Fixes:* Newey-West (HAC) standard errors that account for serial correlation. Add lagged dependent variables or lagged predictors. Use GLS if the autocorrelation structure is known. For strong autocorrelation, model the error process explicitly (ARIMA errors).

4. Non-linearity - *Detection:* Residual-vs-fitted plot showing a curved pattern (U-shape, wave). Residuals-vs-predictor plots for each variable. - *What breaks:* The linear model is misspecified. Coefficients are biased and predictions are systematically wrong in parts of the input space. - *Fixes:* Add polynomial terms ($x^2$, $x^3$), interaction terms ($x_1 \cdot x_2$), or use splines. Transform variables (log, square root). Use a non-linear model (GAM, random forest) if linearity is fundamentally wrong.

5. Outliers and Leverage Points - *Detection:* Cook's distance (measures influence of each observation on all fitted values). Leverage values $h_{ii}$ from the hat matrix ($h_{ii} > 2p/n$ is high leverage). Studentized residuals

Loading problems...
gt; 3$ in absolute value. - *What breaks:* OLS minimizes squared errors, so outliers pull the fit disproportionately. A single extreme point can flip coefficient signs. - *Fixes:* Investigate outliers first (data error? genuine extreme observation?). Robust regression (Huber loss, M-estimators). Median regression ($L_1$ loss). In finance, winsorize extreme returns at the 1st/99th percentile.

6. Omitted Variable Bias - *Detection:* Hard to detect directly -- this is a modeling judgment. Suspicious signs: adding a new variable dramatically changes existing coefficients. Theory predicts a relationship but data does not show it. - *What breaks:* If an omitted variable $z$ is correlated with both an included predictor $x$ and the outcome $y$, the coefficient on $x$ is biased: $\text{plim}(\hat{\beta}_x) = \beta_x + \beta_z \cdot \delta_{zx}$ where $\delta_{zx}$ is the regression coefficient of $z$ on $x$. - *Fixes:* Include the omitted variable. Use instrumental variables if the omitted variable is unobservable. Fixed effects (panel data) to absorb unobserved heterogeneity.

7. Endogeneity - *Detection:* Hausman test (compares OLS and IV estimates -- if they differ significantly, endogeneity is present). Domain knowledge is often the primary diagnostic. - *What breaks:* $E[\epsilon \mid X] \neq 0$ -- the error is correlated with the predictor. OLS coefficients are biased and inconsistent (the bias does not go away with more data). - *Fixes:* Instrumental Variables / Two-Stage Least Squares (2SLS). Find an instrument $z$ that is correlated with $x$ (relevance) but uncorrelated with $\epsilon$ (exogeneity). Natural experiments, regression discontinuity, or difference-in-differences designs.

Answer: The seven common regression pathologies fall into two categories: (1) violations that affect inference but not bias (multicollinearity, heteroscedasticity, autocorrelation -- fix with robust SEs, ridge, or GLS), and (2) violations that cause bias (non-linearity, omitted variables, endogeneity -- fix with model respecification, adding variables, or instrumental variables). Outliers can cause both. Always start with residual plots before looking at coefficient tables.

Intuition

The meta-lesson here is that OLS is remarkably resilient to some assumption violations and catastrophically fragile to others. Heteroscedasticity and autocorrelation are annoying but manageable -- your coefficients are still right, you just need better standard errors. But omitted variable bias and endogeneity are fatal -- your coefficients are wrong, period, and no amount of robust standard errors can fix that. Knowing this hierarchy lets you triage regression problems quickly.

In finance, the most common and dangerous issue is endogeneity through simultaneity. Does high trading volume cause price movement, or does price movement cause high volume? Both. Regressing one on the other with OLS gives you a meaningless number. This is why finance researchers obsess over natural experiments and instrumental variables -- it is the only way to get causal estimates when everything is endogenous. For a practitioner, the practical rule is: be very skeptical of any regression coefficient that could run in both directions.

Open the full interactive solver →