Online Ridge Regression via Sherman-Morrison

Regression · Hard · Free problem

You receive features $x_t \in \mathbb{R}^p$ and responses $y_t \in \mathbb{R}$ one observation at a time. You want to maintain the ridge regression solution:

$\hat{\beta}_t = (X_t^\top X_t + \lambda I)^{-1} X_t^\top y_t$

online, where $X_t$ is the $t \times p$ design matrix and $\lambda > 0$ is the ridge parameter.

  1. Using the Sherman-Morrison formula, derive an $O(p^2)$-per-update recursion for the inverse matrix $A_t^{-1} = (X_t^\top X_t + \lambda I)^{-1}$ given $A_{t-1}^{-1}$ and the new observation $x_t$.
  1. Derive the corresponding update for $\hat{\beta}_t$ in terms of $\hat{\beta}_{t-1}$, $A_t^{-1}$, $x_t$, and $y_t$.
  1. What numerical stability safeguards should be used in practice?

Hints

  1. The new observation adds a rank-1 term $x_t x_t^\top$ to the Gram matrix. What formula inverts a matrix after a rank-1 perturbation?
  2. Apply Sherman-Morrison: $(A + uv^\top)^{-1} = A^{-1} - \frac{A^{-1} u v^\top A^{-1}}{1 + v^\top A^{-1} u}$ with $u = v = x_t$. Define a gain vector $k_t$ to simplify notation.
  3. For the coefficient update, substitute the Sherman-Morrison form of $A_t^{-1}$ into $\hat{\beta}_t = A_t^{-1}(z_{t-1} + x_t y_t)$ and simplify. You should get $\hat{\beta}_t = \hat{\beta}_{t-1} + k_t(y_t - x_t^\top \hat{\beta}_{t-1})$.

Worked Solution

How to Think About It: Recomputing the full ridge solution from scratch each time a new observation arrives costs $O(p^3)$ for the matrix inversion. Since each new observation $x_t$ adds a rank-1 update $x_t x_t^\top$ to the Gram matrix, the Sherman-Morrison formula lets you update the inverse in $O(p^2)$. This is the same idea behind recursive least squares (RLS), which is a workhorse in signal processing and online learning. The key is to maintain both the inverse matrix and the coefficient vector incrementally.

Key Insight: The matrix $A_t = X_t^\top X_t + \lambda I = A_{t-1} + x_t x_t^\top$ differs from $A_{t-1}$ by a rank-1 matrix, so Sherman-Morrison applies directly.

The Method:

Part 1: Inverse update via Sherman-Morrison.

We have $A_t = A_{t-1} + x_t x_t^\top$. The Sherman-Morrison formula gives:

$A_t^{-1} = A_{t-1}^{-1} - \frac{A_{t-1}^{-1} x_t x_t^\top A_{t-1}^{-1}}{1 + x_t^\top A_{t-1}^{-1} x_t}$

Define the gain vector: $k_t = \frac{A_{t-1}^{-1} x_t}{1 + x_t^\top A_{t-1}^{-1} x_t}$

Then: $A_t^{-1} = A_{t-1}^{-1} - k_t x_t^\top A_{t-1}^{-1}$

Cost: computing $A_{t-1}^{-1} x_t$ is $O(p^2)$, the scalar denominator is $O(p)$, and the outer product update is $O(p^2)$. Total: $O(p^2)$ per step.

Initialization: $A_0^{-1} = \frac{1}{\lambda} I$.

Part 2: Coefficient update.

Define $z_t = X_t^\top y_t = z_{t-1} + x_t y_t$. Then: $\hat{\beta}_t = A_t^{-1} z_t = A_t^{-1}(z_{t-1} + x_t y_t)$

Using the Sherman-Morrison form of $A_t^{-1}$: $\hat{\beta}_t = (A_{t-1}^{-1} - k_t x_t^\top A_{t-1}^{-1})(z_{t-1} + x_t y_t)$ $= A_{t-1}^{-1} z_{t-1} + A_{t-1}^{-1} x_t y_t - k_t x_t^\top A_{t-1}^{-1} z_{t-1} - k_t x_t^\top A_{t-1}^{-1} x_t y_t$ $= \hat{\beta}_{t-1} + A_{t-1}^{-1} x_t y_t - k_t x_t^\top \hat{\beta}_{t-1} - k_t (x_t^\top A_{t-1}^{-1} x_t) y_t$

Note that $k_t = \frac{A_{t-1}^{-1} x_t}{1 + x_t^\top A_{t-1}^{-1} x_t}$, so $A_{t-1}^{-1} x_t = k_t(1 + x_t^\top A_{t-1}^{-1} x_t)$. Let $s_t = x_t^\top A_{t-1}^{-1} x_t$. Then:

$\hat{\beta}_t = \hat{\beta}_{t-1} + k_t(1 + s_t) y_t - k_t x_t^\top \hat{\beta}_{t-1} - k_t s_t y_t$ $= \hat{\beta}_{t-1} + k_t(y_t + s_t y_t - x_t^\top \hat{\beta}_{t-1} - s_t y_t)$ $= \hat{\beta}_{t-1} + k_t(y_t - x_t^\top \hat{\beta}_{t-1})$

This is the classic recursive least squares update: $\boxed{\hat{\beta}_t = \hat{\beta}_{t-1} + k_t \left(y_t - x_t^\top \hat{\beta}_{t-1}\right)}$

The term $(y_t - x_t^\top \hat{\beta}_{t-1})$ is the prediction error using the old coefficients. The gain vector $k_t$ determines how much the coefficients adjust.

Initialization: $\hat{\beta}_0 = 0$.

Part 3: Numerical stability safeguards.

  1. Symmetry enforcement: After each update, set $A_t^{-1} \leftarrow \frac{1}{2}(A_t^{-1} + (A_t^{-1})^\top)$ to correct floating-point asymmetry drift.
  1. Denominator guard: The scalar
    + x_t^\top A_{t-1}^{-1} x_t$ should always be positive (since $A_{t-1}^{-1}$ is positive definite and $\lambda > 0$). Add a floor: if it drops below $\epsilon \approx 10^{-12}$, skip the update or re-initialize.
  1. Condition number monitoring: Track $\|A_t^{-1}\|$ or its trace. If the condition number grows too large, re-compute $A_t^{-1}$ from scratch periodically (e.g., every $T$ steps) using a Cholesky factorization.

4. Joseph form update: For better numerical stability, use the Joseph (symmetric) form: $A_t^{-1} = (I - k_t x_t^\top) A_{t-1}^{-1} (I - k_t x_t^\top)^\top + k_t k_t^\top$ This guarantees positive semi-definiteness is preserved, at the cost of roughly

\times$ the flops.

  1. Feature scaling: Normalize features to similar scales before processing to avoid ill-conditioning in $A_t$.

Answer: The $O(p^2)$ recursions are: (1) $k_t = A_{t-1}^{-1} x_t / (1 + x_t^\top A_{t-1}^{-1} x_t)$, (2) $A_t^{-1} = A_{t-1}^{-1} - k_t x_t^\top A_{t-1}^{-1}$, and (3) $\hat{\beta}_t = \hat{\beta}_{t-1} + k_t(y_t - x_t^\top \hat{\beta}_{t-1})$. Stability safeguards include symmetry enforcement, denominator guarding, periodic re-inversion, and optionally the Joseph form update.

Intuition

This is recursive least squares (RLS), one of the most important algorithms in adaptive filtering and online learning. The core idea is beautifully simple: each new data point shifts the coefficient estimate by the prediction error times a gain vector. The gain vector $k_t$ shrinks over time as the inverse covariance matrix $A_t^{-1}$ contracts -- you become less responsive to new data as you accumulate more observations. This is exactly the Bayesian interpretation: the posterior precision increases with more data, so each new observation matters less. In quant finance, RLS is used for online beta estimation, adaptive hedging, and real-time factor model updates. The numerical stability issues are not academic -- in production systems running for months, floating-point drift in the inverse matrix can cause the filter to diverge, which is why periodic re-initialization and the Joseph form are standard practice.

Open the full interactive solver →