Robust Regression with Heavy-Tailed Noise

Regression · Medium · Free problem

You are building a regression model and you suspect that a small fraction of observations are outliers -- bad data, fat-tailed noise, whatever the cause. Standard least squares is going to be sensitive to those points and you know it.

Compare the objectives of least squares (MSE) and least absolute deviations (MAE) in terms of what conditional summary each one is targeting. Why does that difference matter when your noise distribution has heavy tails?

Propose a concrete robust regression approach using Huber loss. Describe an algorithm to fit it -- specifically, explain how iteratively reweighted least squares (IRLS) works and why it is a natural fit here.

How would you select the Huber tuning parameter $\delta$? And once you have a fitted model, how would you flag influential points?

Hints

Start by asking what each loss function implies about the noise distribution -- OLS and LAD are both MLEs, just under different distributional assumptions.
Huber loss has a closed-form derivative that looks like a weighted residual, which means you can convert the optimization into a sequence of weighted least squares problems (IRLS).
For the tuning parameter, you need a scale estimate that is itself robust to outliers -- the standard OLS residual variance is contaminated; use the MAD instead, then apply Huber's 1.345 rule.

Worked Solution

How to Think About It: When you fit OLS, you are implicitly assuming your noise is Gaussian -- but you are not just assuming that; you are betting the whole fit on it. A single observation 10 standard deviations from the bulk will drag your coefficients meaningfully, because squaring the residual turns that one point into a catastrophically large term in your objective. MAE is better -- it is implicitly assuming Laplace noise -- but it has its own problem: the gradient is undefined at zero and the estimator can be inefficient when the data really is clean. Huber loss threads the needle: squared near zero (efficient), linear in the tails (robust). The real question for an interviewer is: do you know *why* each loss implies a different target, and can you describe a practical algorithm to fit the robust version?

Key Insight: OLS targets the conditional mean and is the MLE under Gaussian noise. LAD targets the conditional median and is the MLE under Laplace noise. Huber loss corresponds to a Gaussian-Laplace mixture and targets a trimmed or Winsorized mean -- it smoothly interpolates between the two.

The Method:

*Part 1: What each objective is really doing*

For a regression $Y = X\beta + \epsilon$, the OLS objective $\min_{\beta} \sum_{i} (y_i - x_i^T \beta)^2$ is the MLE under $\epsilon \sim \mathcal{N}(0, \sigma^2)$. It estimates the conditional mean $E[Y \mid X]$. Squaring the residuals gives outliers quadratically growing influence: a residual of

Intuition

The deeper principle here is that every regression loss function is implicitly a distributional assumption. When you pick OLS, you are not just choosing a convenient objective -- you are assuming Gaussian noise and accepting that your estimator is optimal in that world and fragile outside it. Least absolute deviations is the same logic applied to Laplace noise. Huber loss is an explicit acknowledgment that your model for the noise is uncertain: you trust the bulk to be roughly Gaussian, but you want insurance against the tails being heavier than expected. This is a general pattern in robust statistics -- you trade some efficiency in the clean case for bounded influence in the contaminated case.

In practice, the IRLS algorithm is worth understanding deeply because the same reweighting idea appears all over applied statistics: M-estimation for location/scale, robust PCA, iterative approaches to generalized linear models, even some forms of regularized regression. The key insight is that linear estimators are easy to compute but fragile, so you make them adaptive by iterating the weights. The

.345 \hat{\sigma}$ rule for $\delta$ is a classic example of calibrating a robust procedure to a reference distribution -- you lose almost nothing when the Gaussian assumption holds, and you gain substantial protection when it does not.