Answer the following questions about optimization and regression. 1. What value of $A$ minimizes $\sum_{i=1}^{n} (x_i - A)^2$? Prove it. 2. What value of $A$ minimizes $\sum_{i=1}^{n} |x_i - A|$? Why? 3. Consider the linear model $Y = X\beta$ with squared-error loss $L(\beta) = \|Y - X\beta\|^2$. (…

Loss Function Minimizers and Regression Variants

Regression · Medium · Free problem

Answer the following questions about optimization and regression.

What value of $A$ minimizes $\sum_{i=1}^{n} (x_i - A)^2$? Prove it.

What value of $A$ minimizes $\sum_{i=1}^{n} |x_i - A|$? Why?

Consider the linear model $Y = X\beta$ with squared-error loss $L(\beta) = \|Y - X\beta\|^2$.

(a) Derive the OLS estimator $\hat{\beta}$.

(b) What goes wrong if $X^T X$ is singular or ill-conditioned?

(d) How is the lasso estimator computed, and why does it not have a closed-form solution?

Hints

For parts 1 and 2, think about which summary statistic minimizes each loss function. Squared loss and absolute loss have fundamentally different minimizers.
For OLS, take the matrix gradient of $(Y - X\beta)^T(Y - X\beta)$ and set it to zero. Ridge regression just adds $\lambda I$ to $X^T X$.
Lasso has no closed form because $|\beta_j|$ is not differentiable at zero. The solution involves a soft-thresholding operator applied coordinate-by-coordinate.

Worked Solution

How to Think About It: These four questions form a natural progression from the simplest optimization problem (minimize squared loss over a scalar) to the full machinery of regularized regression. The common thread is: the loss function you choose determines which summary statistic you recover. Squared loss gives you the mean, absolute loss gives you the median, and adding penalty terms to squared loss gives you shrunken versions of OLS that trade bias for stability.

Formal Solution:

Part 1: Minimizer of $L_2$ loss -- the mean.

Take the derivative and set it to zero:

$\frac{d}{dA} \sum_{i=1}^{n} (x_i - A)^2 = -2 \sum_{i=1}^{n} (x_i - A) = 0$

$\sum_{i=1}^{n} x_i = nA \implies A^{*} = \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$

The second derivative is

Intuition

This sequence of problems illustrates a deep principle: the loss function you choose determines the estimator you get, and each choice embeds different assumptions about your data. Squared loss produces the mean, which is optimal under Gaussian noise but sensitive to outliers. Absolute loss produces the median, which is robust but harder to optimize. OLS is the multivariate extension of mean estimation, and its vulnerabilities (singularity, instability) mirror the problems of estimating a mean with too little data or too much collinearity.

Ridge and lasso are the two canonical regularization strategies, and their difference comes down to geometry: the $L_2$ ball is round (shrinks all coefficients smoothly), while the $L_1$ ball has corners (drives coefficients exactly to zero). This is why lasso performs feature selection and ridge does not. In practice, most quant modeling pipelines use elastic net (a blend of both) to get the benefits of sparsity with the stability of ridge.