Robust Regression with Heavy-Tailed Noise

Regression · Medium · Free problem

You are building a regression model and you suspect that a small fraction of observations are outliers -- bad data, fat-tailed noise, whatever the cause. Standard least squares is going to be sensitive to those points and you know it.

  1. Compare the objectives of least squares (MSE) and least absolute deviations (MAE) in terms of what conditional summary each one is targeting. Why does that difference matter when your noise distribution has heavy tails?
  1. Propose a concrete robust regression approach using Huber loss. Describe an algorithm to fit it -- specifically, explain how iteratively reweighted least squares (IRLS) works and why it is a natural fit here.
  1. How would you select the Huber tuning parameter $\delta$? And once you have a fitted model, how would you flag influential points?

Hints

  1. Start by asking what each loss function implies about the noise distribution -- OLS and LAD are both MLEs, just under different distributional assumptions.
  2. Huber loss has a closed-form derivative that looks like a weighted residual, which means you can convert the optimization into a sequence of weighted least squares problems (IRLS).
  3. For the tuning parameter, you need a scale estimate that is itself robust to outliers -- the standard OLS residual variance is contaminated; use the MAD instead, then apply Huber's 1.345 rule.

Worked Solution

How to Think About It: When you fit OLS, you are implicitly assuming your noise is Gaussian -- but you are not just assuming that; you are betting the whole fit on it. A single observation 10 standard deviations from the bulk will drag your coefficients meaningfully, because squaring the residual turns that one point into a catastrophically large term in your objective. MAE is better -- it is implicitly assuming Laplace noise -- but it has its own problem: the gradient is undefined at zero and the estimator can be inefficient when the data really is clean. Huber loss threads the needle: squared near zero (efficient), linear in the tails (robust). The real question for an interviewer is: do you know *why* each loss implies a different target, and can you describe a practical algorithm to fit the robust version?

Key Insight: OLS targets the conditional mean and is the MLE under Gaussian noise. LAD targets the conditional median and is the MLE under Laplace noise. Huber loss corresponds to a Gaussian-Laplace mixture and targets a trimmed or Winsorized mean -- it smoothly interpolates between the two.

The Method:

*Part 1: What each objective is really doing*

For a regression $Y = X\beta + \epsilon$, the OLS objective $\min_{\beta} \sum_{i} (y_i - x_i^T \beta)^2$ is the MLE under $\epsilon \sim \mathcal{N}(0, \sigma^2)$. It estimates the conditional mean $E[Y \mid X]$. Squaring the residuals gives outliers quadratically growing influence: a residual of

\sigma$ contributes 4 times as much as one of $\sigma$, a residual of $5\sigma$ contributes 25 times as much. Under heavy-tailed noise, this blows up.

The LAD objective $\min_{\beta} \sum_{i} |y_i - x_i^T \beta|$ is the MLE under $\epsilon \sim \text{Laplace}(0, b)$. It estimates the conditional median. Influence grows only linearly with the residual, so outliers are down-weighted rather than amplified. The catch: it can be inefficient when the bulk of the data really is Gaussian.

*Part 2: Huber loss and IRLS*

Huber loss combines the two: $L_\delta(r) = \begin{cases} \frac{1}{2} r^2 & |r| \leq \delta \\ \delta \left(|r| - \frac{\delta}{2}\right) & |r| > \delta \end{cases}$

For small residuals it behaves like OLS (quadratic, efficient). For large residuals it behaves like LAD (linear, robust). The parameter $\delta$ is the cutoff between the two regimes.

The natural fitting algorithm is iteratively reweighted least squares (IRLS). The idea: Huber loss can be written as a weighted sum of squared residuals, where the weights depend on the current residuals. So you alternate between (a) computing weights from current residuals, and (b) solving a weighted OLS problem with those weights.

Formally, the gradient of the Huber objective at $\beta$ is: $\frac{\partial}{\partial \beta} \sum_i L_\delta(r_i) = -\sum_i \psi_\delta(r_i) x_i$ where $\psi_\delta(r) = \min(\delta, \max(-\delta, r))$ is the Huber influence function (the derivative of $L_\delta$). You can rewrite the normal equations as a weighted least squares problem with weight $w_i = \psi_\delta(r_i) / r_i$ (set $w_i = 1$ if $r_i = 0$).

IRLS algorithm: 1. Initialize $\hat{\beta}^{(0)}$ via OLS. 2. Compute residuals $r_i^{(t)} = y_i - x_i^T \hat{\beta}^{(t)}$. 3. Compute weights $w_i^{(t)} = \min(1, \delta / |r_i^{(t)}|)$. 4. Solve weighted OLS: $\hat{\beta}^{(t+1)} = (X^T W^{(t)} X)^{-1} X^T W^{(t)} y$. 5. Repeat steps 2-4 until convergence (e.g., $\|\hat{\beta}^{(t+1)} - \hat{\beta}^{(t)}\|_2 < \varepsilon$).

Note what this is doing: outlier observations (large $|r_i|$) get downweighted toward $\delta / |r_i|$, while well-behaved observations keep $w_i = 1$ and contribute fully. The algorithm is just reweighted OLS at each step, so each iteration is fast and numerically stable.

*Part 3: Selecting $\delta$ and flagging influential points*

Tuning $\delta$: The standard choice is $\delta = 1.345 \hat{\sigma}$, which gives 95% efficiency relative to OLS when the errors are truly Gaussian (due to Huber's theoretical analysis). You need a robust estimate of scale -- use the normalized MAD: $\hat{\sigma} = \frac{\text{median}(|r_i - \text{median}(r_i)|)}{0.6745}$ Do not use the OLS residual variance for this -- it is contaminated by outliers. You can also cross-validate $\delta$ directly if you have enough data, but the

.345 \hat{\sigma}$ rule is the standard default.

Flagging influential points: After fitting, flag observations using: - Standardized residuals: $|\hat{r}_i| / \hat{\sigma} > 2.5$ is a common threshold. - Cook's distance: measures how much all fitted values change when observation $i$ is deleted. $D_i > 4/n$ or $D_i > 1$ (stricter) are common flags. - Hat matrix leverage: $h_{ii} = x_i^T (X^T X)^{-1} x_i > 2p/n$ flags high-leverage points (points unusual in $X$-space, independent of whether their $Y$ value is an outlier). - DFFITS / DFBETAS: direct measures of how each observation shifts individual fitted values or coefficients.

In practice: flag points with both high leverage and large standardized residual -- those are the ones actually distorting your fit.

Answer: OLS targets the conditional mean (MLE under Gaussian noise) and gives outliers quadratic influence; LAD targets the conditional median (MLE under Laplace noise) and gives outliers linear influence. Huber loss blends the two with parameter $\delta$. Fit it via IRLS -- reweight each observation by $\min(1, \delta/|r_i|)$ and solve weighted OLS at each step. Set $\delta = 1.345 \hat{\sigma}$ using the MAD-based scale estimate. Flag influential points by combining standardized residuals, Cook's distance, and hat-matrix leverage.

Intuition

The deeper principle here is that every regression loss function is implicitly a distributional assumption. When you pick OLS, you are not just choosing a convenient objective -- you are assuming Gaussian noise and accepting that your estimator is optimal in that world and fragile outside it. Least absolute deviations is the same logic applied to Laplace noise. Huber loss is an explicit acknowledgment that your model for the noise is uncertain: you trust the bulk to be roughly Gaussian, but you want insurance against the tails being heavier than expected. This is a general pattern in robust statistics -- you trade some efficiency in the clean case for bounded influence in the contaminated case.

In practice, the IRLS algorithm is worth understanding deeply because the same reweighting idea appears all over applied statistics: M-estimation for location/scale, robust PCA, iterative approaches to generalized linear models, even some forms of regularized regression. The key insight is that linear estimators are easy to compute but fragile, so you make them adaptive by iterating the weights. The

.345 \hat{\sigma}$ rule for $\delta$ is a classic example of calibrating a robust procedure to a reference distribution -- you lose almost nothing when the Gaussian assumption holds, and you gain substantial protection when it does not.

Open the full interactive solver →