Regression Diagnostics: Outliers, Influence, and Multicollinearity
You are running an OLS regression $y = X\beta + \varepsilon$ with an intercept. Define the hat matrix $H = X(X^\top X)^{-1}X^\top$, the leverage of observation $i$ as $h_{ii} = H_{ii}$, and the residuals $e = y - \hat{y}$.
- Define externally studentized residuals and Cook's distance. Explain how each is used to distinguish outliers (points with large residuals) from influential points (points that shift the fit substantially when removed).
- Propose a concrete workflow to flag problematic observations when $\varepsilon$ may be heavy-tailed, so the usual normality assumption is doubtful. Your workflow should include at least one robust alternative to OLS (e.g., LAD regression or an M-estimator) and at least one sensitivity check (e.g., refitting after downweighting or removing flagged points).
- Explain how multicollinearity interacts with influence measures. How can high collinearity make leverage diagnostics misleading, and how would you diagnose it?
Hints
- Think about why using the full-sample residual variance to standardize residuals can be problematic when the point you are testing might itself be an outlier.
- Cook's distance decomposes into two factors -- one measuring residual size and one measuring leverage. Write out the decomposition $D_i = (r_i^2 / p) \cdot h_{ii}/(1 - h_{ii})$ and think about what each factor captures.
- For the multicollinearity part, consider how $(X^\top X)^{-1}$ behaves when columns of $X$ are nearly linearly dependent, and why this makes leverage values unreliable on their own.
Worked Solution
How to Think About It: Regression diagnostics is about answering a simple practical question: which data points are driving your results, and should you be worried about that? In quant work -- fitting factor models, calibrating curves, running cross-sectional regressions -- you need to know if one observation is dragging your coefficients around. The core tension is that a point can be an outlier (large residual) without being influential (it might be in a region with lots of other data), and it can be influential without being an outlier (high leverage points often have small residuals precisely because the fit bends toward them). Getting this distinction right is what separates a thoughtful analyst from someone who just runs lm() and moves on.
Key Insight: Leverage $h_{ii}$ tells you how unusual a point's $x$-values are; the studentized residual tells you how unusual its $y$-value is given those $x$-values; Cook's distance combines both to measure the point's actual impact on the fitted coefficients.
The Method:
*Part (i): Definitions and Interpretation*
The internally studentized residual is $r_i = e_i / (s \sqrt{1 - h_{ii}})$, where $s^2 = \text{RSS} / (n - p)$. The problem is that observation $i