Online Ridge Regression via Sherman-Morrison

Regression · Hard · Free problem

You receive features $x_t \in \mathbb{R}^p$ and responses $y_t \in \mathbb{R}$ one observation at a time. You want to maintain the ridge regression solution:

$\hat{\beta}_t = (X_t^\top X_t + \lambda I)^{-1} X_t^\top y_t$

online, where $X_t$ is the $t \times p$ design matrix and $\lambda > 0$ is the ridge parameter.

Using the Sherman-Morrison formula, derive an $O(p^2)$-per-update recursion for the inverse matrix $A_t^{-1} = (X_t^\top X_t + \lambda I)^{-1}$ given $A_{t-1}^{-1}$ and the new observation $x_t$.

Derive the corresponding update for $\hat{\beta}_t$ in terms of $\hat{\beta}_{t-1}$, $A_t^{-1}$, $x_t$, and $y_t$.

What numerical stability safeguards should be used in practice?

Hints

The new observation adds a rank-1 term $x_t x_t^\top$ to the Gram matrix. What formula inverts a matrix after a rank-1 perturbation?
Apply Sherman-Morrison: $(A + uv^\top)^{-1} = A^{-1} - \frac{A^{-1} u v^\top A^{-1}}{1 + v^\top A^{-1} u}$ with $u = v = x_t$. Define a gain vector $k_t$ to simplify notation.
For the coefficient update, substitute the Sherman-Morrison form of $A_t^{-1}$ into $\hat{\beta}_t = A_t^{-1}(z_{t-1} + x_t y_t)$ and simplify. You should get $\hat{\beta}_t = \hat{\beta}_{t-1} + k_t(y_t - x_t^\top \hat{\beta}_{t-1})$.

Worked Solution

How to Think About It: Recomputing the full ridge solution from scratch each time a new observation arrives costs $O(p^3)$ for the matrix inversion. Since each new observation $x_t$ adds a rank-1 update $x_t x_t^\top$ to the Gram matrix, the Sherman-Morrison formula lets you update the inverse in $O(p^2)$. This is the same idea behind recursive least squares (RLS), which is a workhorse in signal processing and online learning. The key is to maintain both the inverse matrix and the coefficient vector incrementally.

Key Insight: The matrix $A_t = X_t^\top X_t + \lambda I = A_{t-1} + x_t x_t^\top$ differs from $A_{t-1}$ by a rank-1 matrix, so Sherman-Morrison applies directly.

The Method:

Part 1: Inverse update via Sherman-Morrison.

We have $A_t = A_{t-1} + x_t x_t^\top$. The Sherman-Morrison formula gives:

$A_t^{-1} = A_{t-1}^{-1} - \frac{A_{t-1}^{-1} x_t x_t^\top A_{t-1}^{-1}}{1 + x_t^\top A_{t-1}^{-1} x_t}$

Define the gain vector: $k_t = \frac{A_{t-1}^{-1} x_t}{1 + x_t^\top A_{t-1}^{-1} x_t}$

Then: $A_t^{-1} = A_{t-1}^{-1} - k_t x_t^\top A_{t-1}^{-1}$

Cost: computing $A_{t-1}^{-1} x_t$ is $O(p^2)$, the scalar denominator is $O(p)$, and the outer product update is $O(p^2)$. Total: $O(p^2)$ per step.

Initialization: $A_0^{-1} = \frac{1}{\lambda} I$.

Part 2: Coefficient update.

Define $z_t = X_t^\top y_t = z_{t-1} + x_t y_t$. Then: $\hat{\beta}_t = A_t^{-1} z_t = A_t^{-1}(z_{t-1} + x_t y_t)$

Using the Sherman-Morrison form of $A_t^{-1}$: $\hat{\beta}_t = (A_{t-1}^{-1} - k_t x_t^\top A_{t-1}^{-1})(z_{t-1} + x_t y_t)$ $= A_{t-1}^{-1} z_{t-1} + A_{t-1}^{-1} x_t y_t - k_t x_t^\top A_{t-1}^{-1} z_{t-1} - k_t x_t^\top A_{t-1}^{-1} x_t y_t$ $= \hat{\beta}_{t-1} + A_{t-1}^{-1} x_t y_t - k_t x_t^\top \hat{\beta}_{t-1} - k_t (x_t^\top A_{t-1}^{-1} x_t) y_t$

Note that $k_t = \frac{A_{t-1}^{-1} x_t}{1 + x_t^\top A_{t-1}^{-1} x_t}$, so $A_{t-1}^{-1} x_t = k_t(1 + x_t^\top A_{t-1}^{-1} x_t)$. Let $s_t = x_t^\top A_{t-1}^{-1} x_t$. Then:

$\hat{\beta}_t = \hat{\beta}_{t-1} + k_t(1 + s_t) y_t - k_t x_t^\top \hat{\beta}_{t-1} - k_t s_t y_t$ $= \hat{\beta}_{t-1} + k_t(y_t + s_t y_t - x_t^\top \hat{\beta}_{t-1} - s_t y_t)$ $= \hat{\beta}_{t-1} + k_t(y_t - x_t^\top \hat{\beta}_{t-1})$

This is the classic recursive least squares update: $\boxed{\hat{\beta}_t = \hat{\beta}_{t-1} + k_t \left(y_t - x_t^\top \hat{\beta}_{t-1}\right)}$

The term $(y_t - x_t^\top \hat{\beta}_{t-1})$ is the prediction error using the old coefficients. The gain vector $k_t$ determines how much the coefficients adjust.

Initialization: $\hat{\beta}_0 = 0$.

Part 3: Numerical stability safeguards.

Symmetry enforcement: After each update, set $A_t^{-1} \leftarrow \frac{1}{2}(A_t^{-1} + (A_t^{-1})^\top)$ to correct floating-point asymmetry drift.

Denominator guard: The scalar
+ x_t^\top A_{t-1}^{-1} x_t$ should always be positive (since $A_{t-1}^{-1}$ is positive definite and $\lambda > 0$). Add a floor: if it drops below $\epsilon \approx 10^{-12}$, skip the update or re-initialize.

Condition number monitoring: Track $\|A_t^{-1}\|$ or its trace. If the condition number grows too large, re-compute $A_t^{-1}$ from scratch periodically (e.g., every $T$ steps) using a Cholesky factorization.

4. Joseph form update: For better numerical stability, use the Joseph (symmetric) form: $A_t^{-1} = (I - k_t x_t^\top) A_{t-1}^{-1} (I - k_t x_t^\top)^\top + k_t k_t^\top$ This guarantees positive semi-definiteness is preserved, at the cost of roughly

Online Ridge Regression via Sherman-Morrison

Hints

Worked Solution

Intuition