Regression Diagnostics: Outliers, Influence, and Multicollinearity

Regression · Medium · Free problem

You are running an OLS regression $y = X\beta + \varepsilon$ with an intercept. Define the hat matrix $H = X(X^\top X)^{-1}X^\top$, the leverage of observation $i$ as $h_{ii} = H_{ii}$, and the residuals $e = y - \hat{y}$.

Define externally studentized residuals and Cook's distance. Explain how each is used to distinguish outliers (points with large residuals) from influential points (points that shift the fit substantially when removed).

Propose a concrete workflow to flag problematic observations when $\varepsilon$ may be heavy-tailed, so the usual normality assumption is doubtful. Your workflow should include at least one robust alternative to OLS (e.g., LAD regression or an M-estimator) and at least one sensitivity check (e.g., refitting after downweighting or removing flagged points).

Explain how multicollinearity interacts with influence measures. How can high collinearity make leverage diagnostics misleading, and how would you diagnose it?

Hints

Think about why using the full-sample residual variance to standardize residuals can be problematic when the point you are testing might itself be an outlier.
Cook's distance decomposes into two factors -- one measuring residual size and one measuring leverage. Write out the decomposition $D_i = (r_i^2 / p) \cdot h_{ii}/(1 - h_{ii})$ and think about what each factor captures.
For the multicollinearity part, consider how $(X^\top X)^{-1}$ behaves when columns of $X$ are nearly linearly dependent, and why this makes leverage values unreliable on their own.

Worked Solution

How to Think About It: Regression diagnostics is about answering a simple practical question: which data points are driving your results, and should you be worried about that? In quant work -- fitting factor models, calibrating curves, running cross-sectional regressions -- you need to know if one observation is dragging your coefficients around. The core tension is that a point can be an outlier (large residual) without being influential (it might be in a region with lots of other data), and it can be influential without being an outlier (high leverage points often have small residuals precisely because the fit bends toward them). Getting this distinction right is what separates a thoughtful analyst from someone who just runs lm() and moves on.

Key Insight: Leverage $h_{ii}$ tells you how unusual a point's $x$-values are; the studentized residual tells you how unusual its $y$-value is given those $x$-values; Cook's distance combines both to measure the point's actual impact on the fitted coefficients.

The Method:

*Part (i): Definitions and Interpretation*

The internally studentized residual is $r_i = e_i / (s \sqrt{1 - h_{ii}})$, where $s^2 = \text{RSS} / (n - p)$. The problem is that observation $i

s own residual contaminates $s$. If the point is a massive outlier, $s$ is inflated, and $r_i$ gets pulled back toward zero -- the outlier masks itself.

The externally studentized residual fixes this:

$t_i = \frac{e_i}{s_{(i)} \sqrt{1 - h_{ii}}}$

where $s_{(i)}^2$ is the residual variance estimated from the regression that omits observation $i$. Under normality, $t_i \sim t_{n-p-1}$, which gives you a clean test: flag points where $|t_i| > t_{n-p-1, \alpha/2n}$ (Bonferroni-corrected). This detects outliers -- points whose $y$-values are unexpected given their $x$-values.

Cook's distance measures how much all $n$ fitted values change when observation $i$ is deleted:

$D_i = \frac{(\hat{\beta} - \hat{\beta}_{(i)})^\top (X^\top X)(\hat{\beta} - \hat{\beta}_{(i)})}{p \, s^2} = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}$

Notice that $D_i$ is the product of two factors: how large the residual is (outlier-ness) and how far the point is from the center of the data in $x$-space (leverage). A common rule of thumb is $D_i > 4/n$ or $D_i > 1$ as a flag. Cook's distance detects influential points -- observations that move $\hat{\beta}$ substantially.

The distinction matters: a high-leverage point with a small residual has low $|t_i|$ but can still have large $D_i$ if the leverage is extreme. Conversely, a point with a huge residual but $h_{ii} \approx 1/n$ (sitting at the center of the $x$-cloud) has a large $|t_i|$ but modest $D_i$.

*Part (ii): Robust Workflow for Heavy-Tailed Errors*

Run standard OLS and compute $t_i$, $h_{ii}$, $D_i$, and DFFITS $= t_i \sqrt{h_{ii}/(1 - h_{ii})}$ for all observations.

Flag candidate points: $|t_i| > 3$, $h_{ii} > 2p/n$, $D_i > 4/n$. Keep a union of flagged sets.

Fit a robust alternative -- an M-estimator (e.g., Huber or bisquare/Tukey) or Least Absolute Deviations (LAD/median regression). These downweight or ignore large residuals automatically. Compare the robust $\hat{\beta}_{\text{rob}}$ to $\hat{\beta}_{\text{OLS}}$. If they differ substantially (relative to standard errors), your OLS results are being driven by tail observations.

Sensitivity check: Refit OLS after removing or hard-downweighting all flagged points. If the coefficients shift by more than, say, one standard error, then your conclusions depend on a handful of observations -- which is a red flag regardless of the distributional assumption.

Assess the tails: Plot a Q-Q plot of the externally studentized residuals against the $t$-distribution. If the tails are heavy, consider whether the robust fit is more trustworthy, or whether the heavy tails are economically meaningful (e.g., crisis observations that should legitimately influence the model).

Document and decide: Report both OLS and robust estimates. In practice, the question is not "should I delete outliers" but "do my conclusions change if I accommodate heavy tails?" If yes, use the robust fit or at least flag the fragility.

*Part (iii): Multicollinearity and Influence*

Multicollinearity inflates leverage values and can make influence diagnostics misleading in several ways:

Leverage inflation: $h_{ii}$ depends on $X(X^\top X)^{-1}X^\top$. When columns of $X$ are nearly collinear, $(X^\top X)^{-1}$ has large entries, so points that are only slightly unusual in any single predictor can have inflated $h_{ii}$ because they are unusual in the direction of the collinear combination.

Coefficient instability: Even moderate-leverage points can cause wild swings in individual $\hat{\beta}_j$ values because the coefficients are poorly determined. Cook's distance might look modest (it measures the overall shift in $\hat{\beta}$), but a single coefficient might flip sign.

Diagnosis: Compute the Variance Inflation Factor $\text{VIF}_j = 1/(1 - R_j^2)$ where $R_j^2$ is the $R^2$ from regressing $x_j$ on all other predictors. A $\text{VIF} > 5{-}10$ indicates problematic collinearity. Also examine the condition number of $X^\top X$ -- values above 30 signal trouble. Once you identify collinear groups, look at $\text{DFBETAS}_{ij}$ (the change in each individual coefficient when observation $i$ is removed) rather than just Cook's distance, since a point can be influential for one coefficient without moving the overall fit much.

Answer: Externally studentized residuals detect outliers by using a leave-one-out variance estimate; Cook's distance detects influential points by combining residual size with leverage. Under heavy tails, supplement OLS diagnostics with robust regression (M-estimators or LAD) and sensitivity checks (refit without flagged points). Multicollinearity inflates leverage and destabilizes individual coefficients -- diagnose it with VIF and condition numbers, and use DFBETAS to check per-coefficient influence.

Intuition

The core lesson here is that "outlier" and "influential point" are fundamentally different concepts that get conflated all the time. An outlier is a point that does not fit the model -- large residual. An influential point is one that changes the model -- remove it and the coefficients shift. The two overlap but are not the same, and the most dangerous influential points are the ones that do NOT look like outliers because the regression line has bent to accommodate them. This is why leverage matters: it tells you how much pull a point has on the fit, regardless of whether the fit currently looks good at that point.

In practice -- running factor regressions, calibrating yield curve models, fitting vol surfaces -- you encounter this constantly. A single day's return during a crisis can dominate an entire year's regression. The workflow of comparing OLS to a robust fit is not just a textbook exercise; it is how you figure out whether your model is capturing a real relationship or just chasing a few extreme observations. The multicollinearity angle adds another layer: when predictors are correlated, influence gets distributed in non-obvious ways, and you can miss influential points entirely if you only look at Cook's distance without checking per-coefficient sensitivity via DFBETAS.

Open the full interactive solver →