Ridge and LASSO: Closed-Form Solutions and Gradient Descent
Consider two standard regularized linear regression objectives. In both cases you have a design matrix $X \in \mathbb{R}^{n \times p}$, a response vector $y \in \mathbb{R}^n$, a coefficient vector $\beta \in \mathbb{R}^p$, and a regularization parameter $\lambda > 0$.
Ridge (L2 penalty): minimize $\|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2$
LASSO (L1 penalty): minimize $\|y - X\beta\|_2^2 + \lambda \|\beta\|_1$
- Derive the closed-form solution for the Ridge regression coefficients.
- Explain why the LASSO objective does not admit a closed-form solution in general, and derive the solution for the special case of a single predictor (the soft-thresholding operator).
- Write out the gradient descent update rule for Ridge. Then explain how you would modify the optimization approach for LASSO, given that the L1 penalty is not differentiable everywhere.
Hints
- Both objectives are convex, but only one is differentiable everywhere. Think about what happens when you try to take the gradient of $|\beta_j|$ at zero.
- For Ridge, expand the objective, differentiate with respect to $\beta$, and solve the resulting linear system. The key identity is $(X^T X + \lambda I)^{-1}$.
- For LASSO optimization, consider splitting the problem into a smooth part (squared loss) and a non-smooth part (L1 penalty). The proximal operator for the L1 norm is the soft-thresholding function.
Worked Solution
How to Think About It: Both Ridge and LASSO add a penalty to ordinary least squares to prevent overfitting, but the geometry of their penalties is fundamentally different. Ridge uses a squared L2 penalty -- smooth and differentiable everywhere -- so you can take a derivative, set it to zero, and solve. LASSO uses an L1 penalty -- it has a kink at zero for each coordinate -- so standard calculus breaks down. This kink is not a bug; it is the feature that drives coefficients exactly to zero and produces sparse models. When an interviewer asks this, they want to see that you understand why one has a closed form and the other does not, and that you know the practical workarounds for optimizing the non-smooth objective.
Part 1: Ridge Closed-Form Solution
The Ridge objective is:
$L(\beta) = (y - X\beta)^T(y - X\beta) + \lambda \beta^T \beta$
Expanding and taking the gradient with respect to $\beta$:
$\nabla_\beta L = -2X^T(y - X\beta) + 2\lambda \beta$
Setting the gradient to zero:
$X^T X \beta + \lambda \beta = X^T y$
$(X^T X + \lambda I)\beta = X^T y$
$\hat{\beta}_{\text{Ridge}} = (X^T X + \lambda I)^{-1} X^T y$
Notice that $X^T X + \lambda I$ is always invertible for $\lambda > 0$, even when $X^T X$ is singular (i.e., when $p > n$ or columns are collinear). Adding $\lambda I$ shifts every eigenvalue of $X^T X$ up by $\lambda$, so none are zero. This is one of the practical advantages of Ridge -- it is always well-defined.
Part 2: Why LASSO Has No Closed Form
The LASSO objective is:
$L(\beta) = \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^{p} |\beta_j|$
The problem is the term $|\beta_j|$. It is not differentiable at $\beta_j = 0$. You cannot take a standard gradient and set it to zero across all coordinates simultaneously (when $p > 1$ and predictors are correlated), because the non-smooth terms interact through $X$.
However, for the single-predictor case (or equivalently, one coordinate update with all others fixed), we can solve explicitly. Consider minimizing over a single scalar $\beta_j$:
$L(\beta_j) = (z_j - \beta_j)^2 + \lambda |\beta_j|$
where $z_j$ is the partial residual OLS coefficient (the least-squares estimate using the residual after removing the contribution of all other predictors). Using subgradient analysis:
- If $z_j > \lambda / 2$: the optimum is $\hat{\beta}_j = z_j - \lambda / 2$
- If $z_j < -\lambda / 2$: the optimum is $\hat{\beta}_j = z_j + \lambda / 2$
- If $|z_j| \leq \lambda / 2$: the optimum is $\hat{\beta}_j = 0$
This is the soft-thresholding operator:
$\hat{\beta}_j = \text{sign}(z_j)\left(|z_j| - \frac{\lambda}{2}\right)_+$
where $(x)_+ = \max(x, 0)$. Coefficients with small OLS estimates get killed to exactly zero -- this is the sparsity mechanism of LASSO.
Part 3: Gradient Descent for Ridge and LASSO
Ridge gradient descent is straightforward. The gradient is smooth, so the standard update with learning rate $\eta$ is:
$\beta \leftarrow \beta - \eta \left( X^T(X\beta - y) + \lambda \beta \right)$
Note the penalty contributes an additive $\lambda \beta$ term -- this shrinks coefficients toward zero by a multiplicative factor each step (sometimes called "weight decay" in ML).
LASSO requires more care because $|\beta_j|$ is non-differentiable at zero. Three common approaches:
- Subgradient descent: Replace the gradient of $|\beta_j|$ with a subgradient: use $\text{sign}(\beta_j)$ when $\beta_j \neq 0$ and any value in $[-1, 1]$ when $\beta_j = 0$. This converges but slowly -- $O(1/\sqrt{t})$ vs. $O(1/t)$ for smooth problems -- and it does not produce exact zeros.
- Proximal gradient descent (ISTA): This is the standard approach. Take a gradient step on the smooth part (the squared loss), then apply the proximal operator for the L1 penalty (which is just the soft-thresholding operator from Part 2):
$\beta^{(t+1/2)} = \beta^{(t)} - \eta X^T(X\beta^{(t)} - y)$
$\beta^{(t+1)}_j = \text{sign}(\beta^{(t+1/2)}_j)\left(|\beta^{(t+1/2)}_j| - \eta \lambda\right)_+$
This converges at rate $O(1/t)$ and does produce exact zeros.
- Coordinate descent: Cycle through coordinates one at a time, applying the single-variable soft-thresholding solution from Part 2. This is the method used by
glmnetand is typically the fastest in practice.
Answer:
- Ridge: $\hat{\beta} = (X^T X + \lambda I)^{-1} X^T y$ (closed-form, always invertible).
- LASSO: no general closed-form due to the non-differentiability of the L1 penalty at zero. The single-coordinate solution is the soft-thresholding operator $\hat{\beta}_j = \text{sign}(z_j)(|z_j| - \lambda/2)_+$.
- Ridge uses standard gradient descent with update $\beta \leftarrow \beta - \eta(X^T(X\beta - y) + \lambda \beta)$. LASSO is best solved by proximal gradient descent (gradient step then soft-threshold) or coordinate descent.
Intuition
The core difference between Ridge and LASSO comes down to the geometry of their constraint sets. The L2 ball is round -- it shrinks all coefficients smoothly toward zero but never reaches it. The L1 diamond has corners on the axes, and those corners are where the penalty contours meet the loss contours for typical data. That is why LASSO produces exact zeros and Ridge does not. This geometric picture also explains why LASSO has no closed-form: the corners create non-differentiable points that prevent a clean matrix-algebra solution.
In practice, this distinction matters enormously. If you are building a linear model with hundreds of features and suspect most are noise, LASSO (or elastic net) will do feature selection for you automatically. Ridge is better when you believe all features contribute but want to stabilize estimates against collinearity. Understanding the optimization landscape -- smooth vs. non-smooth, gradient descent vs. proximal methods -- is not just an academic exercise; it directly affects how you tune and debug these models in production.