Robust Regression for Factor Models via IRLS
You are fitting a cross-sectional factor model for excess returns. Your data: a vector of excess returns $y$ and a standardized factor matrix $X$ (each column has zero mean and unit variance). Instead of ordinary least squares, you use Huber loss with parameter $c > 0$:
$\hat{\beta} = \arg\min_{\beta} \sum_{i} \rho_c(y_i - x_i^\top \beta)$
where $\rho_c(u) = \begin{cases} \tfrac{1}{2}u^2 & |u| \leq c \\ c|u| - \tfrac{1}{2}c^2 & |u| > c \end{cases}$.
Address the following:
- Derive the iteratively reweighted least squares (IRLS) update. Show what weight each observation receives as a function of its current residual, and write out the weighted normal equations.
- State conditions under which the Huber M-estimator is asymptotically normal. Give the sandwich form of the asymptotic covariance matrix.
- Explain intuitively when robust regression improves out-of-sample $R^2$ relative to OLS. What has to be true about the return distribution for robustness to help?
Hints
- Write the first-order condition using the Huber influence function $\psi_c(u) = \rho_c'(u)$, then factor it as $\psi_c(u) = w(u) \cdot u$ to expose the weighted least squares structure.
- The IRLS weight for observation $i$ is $w_i = \min(1, c / |r_i|)$, where $r_i$ is the current residual. Plug these into the WLS normal equations and iterate.
- For asymptotic normality, use the sandwich form $V = A^{-1} B A^{-\top}$ where $A = E[\psi_c'(r_i) x_i x_i^\top]$ and $B = E[\psi_c(r_i)^2 x_i x_i^\top]$; for Huber, $A$ simplifies to the expected weight matrix times $X^\top X$.
Worked Solution
How to Think About It: OLS is the maximum likelihood estimator under Gaussian errors. The moment your residuals have fat tails -- which they always do in cross-sectional equity returns -- OLS starts chasing outliers. Every blow-up stock in a given month gets a massive residual and drags your beta estimates toward fitting that one observation. Huber loss says: up to $c$ standard deviations, treat residuals normally; beyond that, cap the influence to $O(c)$ rather than letting it grow as $O(u^2)$. The practical consequence is that IRLS just runs weighted least squares over and over, downweighting large residuals until convergence. The asymptotic theory says this still gives you a consistent, asymptotically normal estimator -- you just pay a small efficiency cost when errors really are Gaussian.
Key Insight: IRLS works because the first-order condition for Huber loss is equivalent to a weighted least squares problem, where the weights depend on the current residuals. Iterating that mapping converges to the M-estimator.
The Method:
*Part 1 -- IRLS derivation.*
The first-order conditions are:
$\sum_i \psi_c(y_i - x_i^\top \beta) \, x_i = 0$
where $\psi_c(u) = \rho_c'(u)$ is the Huber influence function:
$\psi_c(u) = \begin{cases} u & |u| \leq c \\ c \cdot \text{sign}(u) & |u| > c \end{cases}$
Write $\psi_c(u) = w(u) \cdot u$ by defining the weight function:
$w(u) = \frac{\psi_c(u)}{u} = \begin{cases} 1 & |u| \leq c \\ c / |u| & |u| > c \end{cases}$
This is well-defined for $u \neq 0$ (handle zero residuals separately -- weight is 1). The first-order condition becomes:
$\sum_i w(r_i^{(t)}) \, r_i^{(t)} \, x_i = 0 \quad \text{where } r_i^{(t)} = y_i - x_i^\top \beta^{(t)}$
Substituting $r_i = y_i - x_i^\top \beta$:
$X^\top W^{(t)} (y - X\beta) = 0$
where $W^{(t)} = \text{diag}(w_1^{(t)}, \ldots, w_n^{(t)})$ and $w_i^{(t)} = w(r_i^{(t)})$.
Solving for $\beta^{(t+1)}$:
$\boxed{\beta^{(t+1)} = (X^\top W^{(t)} X)^{-1} X^\top W^{(t)} y}$
This is just weighted OLS with time-varying weights. Repeat until convergence (typically 5-20 iterations). The algorithm is called IRLS because the weights are updated at each step using the residuals from the previous step.
Weights interpretation: Observations within $c$ residual units keep weight 1 (treated like OLS). Observations with large residuals get weight $c/|r_i|$, which shrinks toward zero as the residual grows. This is the soft downweighting that makes Huber robust without hard-discarding points.
*Part 2 -- Asymptotic normality.*
Under standard M-estimation theory, let $r_i = y_i - x_i^\top \beta_0$ be the true residuals. Define:
$A = E\left[\psi_c'(r_i) \, x_i x_i^\top\right], \quad B = E\left[\psi_c(r_i)^2 \, x_i x_i^\top\right]$
Required conditions: - The true errors $\varepsilon_i$ are i.i.d. with a density $f$ that is symmetric, continuous at $\pm c$, and has finite second moments. - $X$ is full rank and $\frac{1}{n} X^\top X \to Q$ for a positive definite $Q$. - The loss $\rho_c$ is convex and $\psi_c$ is bounded (which holds by construction).
Then:
$\sqrt{n}(\hat{\beta} - \beta_0) \xrightarrow{d} N(0, V)$
where the sandwich covariance is:
$V = A^{-1} B A^{-\top}$
For the Huber estimator specifically:
$A = E\left[\mathbf{1}\{|r_i| \leq c\} \, x_i x_i^\top\right] + 0 \cdot \mathbf{1}\{|r_i| > c\} = E\left[w(r_i) \, x_i x_i^\top\right]$
(since $\psi_c'(u) = 1$ for $|u| < c$ and $0$ for $|u| > c$, ignoring the kink).
In practice, you estimate $A$ and $B$ by their sample analogs:
$\hat{A} = \frac{1}{n} \sum_i \mathbf{1}\{|\hat{r}_i| \leq c\} x_i x_i^\top, \quad \hat{B} = \frac{1}{n} \sum_i \psi_c(\hat{r}_i)^2 x_i x_i^\top$
If errors are truly Gaussian, this sandwich collapses and Huber efficiency relative to OLS is $\left[2\Phi(c) - 1 - 2c\phi(c)\right] / \left[c^2 (1 - 2\Phi(-c))^2 / F(c)\right]^{-1}$ (roughly 95% efficient at $c = 1.345\sigma$, the standard default).
*Part 3 -- When robust regression beats OLS out-of-sample.*
OLS is BLUE (best linear unbiased estimator) if errors are Gaussian. The moment the error distribution has heavier tails, OLS over-weights the outliers and its beta estimates have inflated variance. That variance translates directly to out-of-sample $R^2$: a noisy beta estimate has higher prediction error.
Robust regression improves out-of-sample $R^2$ when:
- Fat-tailed idiosyncratic returns. Cross-sectional return distributions routinely have excess kurtosis of 5-10. A stock that gaps on an unexpected earnings print has a huge residual -- OLS chases it, robust regression ignores it.
- Outliers carry no information about factor betas. If the outliers are truly idiosyncratic shocks (not factor-driven), there is no signal in them, only noise. Downweighting them reduces estimator variance at zero bias cost.
- Small cross-section or unstable factor loadings. With $n = 300$ stocks, even a handful of extreme residuals can meaningfully shift your beta estimates. Robustness matters most when $n/k$ (observations per parameter) is not huge.
Robust regression does NOT help when: - Errors are genuinely Gaussian (you pay an efficiency penalty for nothing). - The outliers are informative (e.g., a sector rotation where the top/bottom decile stocks actually tell you something about factor exposure). - $c$ is mis-tuned (too large: no robustness; too small: too much bias, you are now doing a near-median regression).
Answer: IRLS update is $\beta^{(t+1)} = (X^\top W^{(t)} X)^{-1} X^\top W^{(t)} y$ with weights $w_i = \min(1, c/|r_i^{(t)}|)$. The estimator is asymptotically normal under i.i.d. symmetric errors with the sandwich covariance $V = A^{-1} B A^{-\top}$. Robust regression improves out-of-sample $R^2$ when residuals are fat-tailed and outliers carry no factor signal -- the typical regime for cross-sectional equity returns.
Intuition
The core insight is that OLS and Huber regression differ only in how they treat large residuals. OLS has a quadratic loss, so the gradient of the loss grows linearly with the residual -- a residual of 10 exerts 10 times more pull on $\hat{\beta}$ than a residual of 1. Huber caps that pull at $c$, so tail observations get bounded influence. IRLS is just a clever way to solve the resulting non-linear equations: at each step, pretend the weights are fixed, solve a weighted least squares problem, then update the weights. It is coordinate descent on a smooth convex problem, and it converges reliably.
In cross-sectional factor models, this matters more than people realize. Monthly cross-sectional return distributions routinely have kurtosis well above 3 -- think biotechs gapping on FDA decisions, or small-caps with short squeezes. OLS treats these as informative data points about factor loadings, which they are not. A robust estimator quarantines these outliers, producing beta estimates that are more stable month-to-month and better calibrated for out-of-sample prediction. The practical lesson: always look at your residual distribution before deciding whether to use OLS or a robust alternative. If you see $|z| > 4$ residuals for more than 1-2% of your cross-section, Huber (or Tukey bisquare) is almost certainly worth the added complexity.