Common Regression Pathologies and Fixes

Regression · Medium · Free problem

You are reviewing a colleague's regression analysis and notice the results look suspicious -- some coefficients have unexpected signs, the residual plots look off, and the standard errors seem too small.

Walk through the most common issues that arise in regression analysis. For each one:

What does it look like (how do you detect it)?
Why does it cause problems (what goes wrong with OLS)?
How do you fix it?

Cover at least: multicollinearity, heteroscedasticity, autocorrelation, non-linearity, outliers, omitted variable bias, and endogeneity.

Hints

Start by classifying issues into two buckets: those that bias your coefficients (omitted variables, endogeneity, non-linearity) and those that only affect your standard errors (heteroscedasticity, autocorrelation, multicollinearity). This distinction determines how urgent the fix is.
For each issue, think about which OLS assumption it violates. The Gauss-Markov assumptions are: linearity, strict exogeneity ($E[\epsilon | X] = 0$), homoscedasticity, no serial correlation, and full rank.
Residual plots are your first line of defense. Plot residuals vs. fitted values (heteroscedasticity, non-linearity), residuals vs. time (autocorrelation), and check VIF (multicollinearity) before interpreting any coefficient.

Worked Solution

How to Think About It: OLS is built on a set of assumptions -- linearity, independence, homoscedasticity, no perfect multicollinearity, exogeneity -- and when these assumptions fail, different things break. Some violations affect coefficient estimates (bias), some affect standard errors (wrong inference), and some affect both. The key to diagnosing regression problems is knowing which assumption is violated and what breaks as a result. A good practitioner checks residual plots before looking at any coefficients.

Key Insight: Not all regression problems are created equal. Multicollinearity and heteroscedasticity affect precision but not bias -- your coefficients are still unbiased, just noisy or with wrong standard errors. Omitted variables and endogeneity create bias -- your coefficients are systematically wrong. This distinction matters enormously for what you do about it.

The Method:

1. Multicollinearity - *Detection:* Variance Inflation Factor (VIF) for each predictor. VIF

Loading problems...

Intuition

The meta-lesson here is that OLS is remarkably resilient to some assumption violations and catastrophically fragile to others. Heteroscedasticity and autocorrelation are annoying but manageable -- your coefficients are still right, you just need better standard errors. But omitted variable bias and endogeneity are fatal -- your coefficients are wrong, period, and no amount of robust standard errors can fix that. Knowing this hierarchy lets you triage regression problems quickly.

In finance, the most common and dangerous issue is endogeneity through simultaneity. Does high trading volume cause price movement, or does price movement cause high volume? Both. Regressing one on the other with OLS gives you a meaningless number. This is why finance researchers obsess over natural experiments and instrumental variables -- it is the only way to get causal estimates when everything is endogenous. For a practitioner, the practical rule is: be very skeptical of any regression coefficient that could run in both directions.