Recursive Least Squares with Forgetting Factor

Regression · Hard · Free problem

You are building an online regression model for a streaming signal. Pairs $(x_t, y_t)$ arrive one at a time, where $x_t \in \mathbb{R}^d$ is a feature vector, $y_t \in \mathbb{R}$ is the response, and the model is linear:

$y_t = x_t^\top \beta + \varepsilon_t$

where $\varepsilon_t$ is zero-mean noise. You want to maintain a running estimate $\hat{\beta}_t$ without re-solving the full least squares problem at every step. You also want a forgetting factor $\lambda \in (0, 1]$ that exponentially down-weights old observations so the model can adapt to non-stationarity.

Derive the recursive update equations for the coefficient estimate $\hat{\beta}_t$ and the inverse covariance matrix $P_t$ (sometimes called the gain form of RLS). Start from the weighted least squares objective and show how the new-observation update avoids re-inverting a $d \times d$ matrix at each step.

Analyze the bias-variance trade-off as $\lambda$ varies. What happens when $\lambda = 1$? What happens as $\lambda \to 0$? What is the effective window length?

Discuss at least two numerical stability safeguards that matter in production (e.g., what can go wrong with $P_t$ over time, and how do you fix it?).

Hints

Start from the exponentially weighted least squares objective and write the recursive form of the normal equations -- this gives you $R_t = \lambda R_{t-1} + x_t x_t^\top$.
To avoid inverting $R_t$ from scratch at each step, apply the Sherman-Morrison (matrix inversion lemma) to update $P_t = R_t^{-1}$ directly.
For the bias-variance analysis, think about the effective sample size $N_{\text{eff}} = 1/(1-\lambda)$ and what happens to estimation variance when you are fitting $d$ parameters from only $N_{\text{eff}}$ observations.

Worked Solution

How to Think About It: Recursive Least Squares is the Kalman filter applied to a static linear model -- each new observation updates your estimate of $\beta$ by a correction proportional to the prediction error. The forgetting factor $\lambda$ makes it a time-varying Kalman filter where older data is exponentially discounted. In quant work, RLS with forgetting shows up everywhere: online hedging, adaptive signal estimation, and real-time factor model fitting. The core trick is the matrix inversion lemma (Sherman-Morrison-Woodbury), which lets you update the inverse covariance in $O(d^2)$ instead of $O(d^3)$.

Quick Estimate: For intuition on the forgetting factor, if $\lambda = 0.99$ the effective window length is

/(1 - \lambda) = 100$ observations. With $\lambda = 0.95$, it is 20 observations. So a small change in $\lambda$ drastically changes how reactive the model is. When $\lambda = 1$, you recover standard (non-forgetting) RLS, which uses all data equally.

Approach: We start from the weighted least squares objective, then apply the matrix inversion lemma to derive a recursive update.

Formal Solution:

*Step 1: Weighted least squares objective.*

At time $t$, we minimize the exponentially weighted sum of squared errors:

$J_t(\beta) = \sum_{i=1}^{t} \lambda^{t-i} \left( y_i - x_i^\top \beta \right)^2$

The solution satisfies the normal equations $R_t \hat{\beta}_t = z_t$, where:

$R_t = \sum_{i=1}^{t} \lambda^{t-i} x_i x_i^\top, \quad z_t = \sum_{i=1}^{t} \lambda^{t-i} x_i y_i$

Note the recursions: $R_t = \lambda R_{t-1} + x_t x_t^\top$ and $z_t = \lambda z_{t-1} + x_t y_t$.

*Step 2: Define $P_t = R_t^{-1}$ and apply the matrix inversion lemma.*

We need $P_t = (\lambda R_{t-1} + x_t x_t^\top)^{-1}$. By Sherman-Morrison:

$P_t = \frac{1}{\lambda} \left( P_{t-1} - \frac{P_{t-1} x_t x_t^\top P_{t-1}}{\lambda + x_t^\top P_{t-1} x_t} \right)$

Define the gain vector:

$k_t = \frac{P_{t-1} x_t}{\lambda + x_t^\top P_{t-1} x_t}$

Then:

$P_t = \frac{1}{\lambda} \left( P_{t-1} - k_t x_t^\top P_{t-1} \right)$

*Step 3: Derive the coefficient update.*

The prediction error (innovation) is:

$e_t = y_t - x_t^\top \hat{\beta}_{t-1}$

The coefficient update is:

$\hat{\beta}_t = \hat{\beta}_{t-1} + k_t \, e_t$

This is the complete RLS recursion. Each step costs $O(d^2)$ -- dominated by the outer products in the $P_t$ update.

*Initialization:* Set $P_0 = \delta^{-1} I$ for small $\delta > 0$ (large initial covariance, reflecting ignorance) and $\hat{\beta}_0 = 0$.

Bias-Variance Trade-off as $\lambda$ Varies:

$\lambda = 1$ (no forgetting): All past observations are weighted equally. The effective sample size grows as $t$. Variance decreases as $O(1/t)$, but the estimator cannot adapt to parameter changes -- high bias if $\beta$ is drifting.
$\lambda$ close to 0 (aggressive forgetting): Only the most recent observations matter. The effective window length is $N_{\text{eff}} \approx 1/(1 - \lambda)$. Variance is high (you are estimating $d$ parameters from roughly $N_{\text{eff}}$ observations), but bias is low because you track changes quickly.
Optimal $\lambda$: Balances tracking speed against estimation noise. In practice, typical values are $0.95$--$0.999$ depending on the application. The effective window $N_{\text{eff}} = 1/(1-\lambda)$ should be several multiples of $d$ to keep the estimation well-conditioned.

Numerical Stability Safeguards:

Symmetry enforcement: After each update, $P_t$ should be symmetric in theory, but floating-point arithmetic introduces asymmetry over thousands of updates. Fix: replace $P_t$ with $(P_t + P_t^\top)/2$ periodically or after every update.

Positive definiteness monitoring: With $\lambda < 1$, the
/\lambda$ scaling in the $P_t$ update can cause eigenvalues to grow without bound if the data is not persistently exciting (i.e., if $x_t$ does not span all $d$ dimensions often enough). $P_t$ can lose positive definiteness due to roundoff. Fix: add a small ridge $P_t \leftarrow P_t + \epsilon I$ when eigenvalues get too small or too large, or use the square-root form (Cholesky factor of $P_t$) which preserves positive definiteness by construction.

Square-root / UD factorization: Instead of maintaining $P_t$ directly, maintain its Cholesky decomposition $P_t = L_t L_t^\top$ (or the $UDU^\top$ factorization). Updates are applied to the factors. This costs more per step but has much better numerical properties over long runs -- it is the standard approach in Kalman filter implementations.

Covariance resetting: If the system detects that $P_t$ has become ill-conditioned (e.g., condition number exceeds a threshold), reset $P_t = \alpha I$ for some $\alpha$. This is a blunt instrument but effective as a safety net.

Answer:

The RLS updates with forgetting factor $\lambda$ are:

$k_t = \frac{P_{t-1} x_t}{\lambda + x_t^\top P_{t-1} x_t}$

$\hat{\beta}_t = \hat{\beta}_{t-1} + k_t (y_t - x_t^\top \hat{\beta}_{t-1})$

$P_t = \frac{1}{\lambda}(P_{t-1} - k_t x_t^\top P_{t-1})$

The effective window length is $N_{\text{eff}} = 1/(1 - \lambda)$. Smaller $\lambda$ tracks changes faster (low bias) but with more noise (high variance). Numerical stability requires enforcing symmetry of $P_t$, monitoring positive definiteness, and ideally using a square-root or UD factorization of the covariance matrix.

Intuition

Recursive Least Squares is really just the Kalman filter wearing a different hat. The gain vector $k_t$ is the Kalman gain, and $P_t$ is the estimation error covariance. The forgetting factor $\lambda$ plays the role of process noise in a Kalman filter -- it inflates uncertainty over time, which keeps the filter receptive to new data. When $\lambda = 1$, you trust old data forever and variance shrinks monotonically. When $\lambda < 1$, you are implicitly saying "the true $\beta$ might be drifting," and the filter compromises between stability and responsiveness.

In practice, RLS with forgetting is one of the workhorses of adaptive signal processing and online quant models. It shows up in real-time hedge ratio estimation, adaptive execution algorithms, and rolling factor exposure models. The most common failure mode is numerical: $P_t$ loses positive definiteness after thousands of updates, and your estimates blow up. This is why production implementations almost always use the square-root or UD factorization rather than maintaining $P_t$ directly. The other common mistake is choosing $\lambda$ without thinking about the effective window -- a value of $0.99$ sounds close to

$, but it means you are effectively using only the last 100 observations, which may not be enough to estimate a high-dimensional $\beta$ reliably.

Open the full interactive solver →