Time-Varying Beta via Kalman Filter

Time Series · Hard · Free problem

An asset's excess return $r_t$ depends on a factor $f_t$ through a time-varying beta:

$r_t = \beta_t f_t + \varepsilon_t, \quad \varepsilon_t \sim N(0, \sigma_\varepsilon^2)$

The latent beta follows a random walk:

$\beta_t = \beta_{t-1} + \eta_t, \quad \eta_t \sim N(0, \sigma_\eta^2)$

with $\varepsilon_t$ and $\eta_t$ independent.

  1. Derive the Kalman filter recursions: the prediction step for $\hat{\beta}_{t|t-1}$ and $P_{t|t-1}$, the update step for $\hat{\beta}_{t|t}$ and $P_{t|t}$, and the Kalman gain $K_t$.
  1. Write down the log-likelihood of the observed returns $\{r_1, \ldots, r_T\}$ under these Gaussian assumptions, using the prediction-error decomposition.
  1. Explain how you would estimate the hyperparameters $(\sigma_\varepsilon^2, \sigma_\eta^2)$ by maximum likelihood.

Hints

  1. Think of the Kalman filter as alternating between two steps: a prediction step that propagates uncertainty forward, and an update step that incorporates new information.
  2. The Kalman gain $K_t$ balances prior uncertainty $P_{t|t-1}$ against observation noise $\sigma_\varepsilon^2$. Write the innovation variance $F_t = f_t^2 P_{t|t-1} + \sigma_\varepsilon^2$.
  3. For the log-likelihood, use the prediction-error decomposition: each innovation $v_t = r_t - f_t \hat{\beta}_{t|t-1}$ is $N(0, F_t)$, so the log-likelihood is $-\frac{1}{2}\sum [\ln F_t + v_t^2/F_t]$ plus a constant.

Worked Solution

How to Think About It: This is a classic state-space model. You have a latent quantity (beta) that drifts over time, and you observe it noisily through returns. The Kalman filter is the optimal linear filter for this setup -- it gives you the best estimate of the hidden beta at each point in time, along with a measure of how uncertain that estimate is. Think of it as a Bayesian update: your prior on $\beta_t$ comes from yesterday's estimate plus the random walk noise, and your likelihood comes from today's return observation. The key to the whole thing is the Kalman gain, which tells you how much to trust the new observation versus your prior.

Key Insight: The Kalman filter is just recursive Bayesian updating for Gaussian linear models. At each step, you predict forward, then correct using the new data point.

The Method:

Step 1 -- Prediction (time update):

Given the filtered estimate $\hat{\beta}_{t-1|t-1}$ with variance $P_{t-1|t-1}$, predict:

$\hat{\beta}_{t|t-1} = \hat{\beta}_{t-1|t-1}$

$P_{t|t-1} = P_{t-1|t-1} + \sigma_\eta^2$

The predicted state is just yesterday's estimate (random walk has no drift), and the uncertainty grows by $\sigma_\eta^2$.

Step 2 -- Prediction error and its variance:

The one-step-ahead prediction error ("innovation") is:

$v_t = r_t - f_t \hat{\beta}_{t|t-1}$

Its variance is:

$F_t = f_t^2 P_{t|t-1} + \sigma_\varepsilon^2$

Step 3 -- Update (measurement update):

The Kalman gain is:

$K_t = \frac{P_{t|t-1} f_t}{F_t} = \frac{P_{t|t-1} f_t}{f_t^2 P_{t|t-1} + \sigma_\varepsilon^2}$

The filtered estimate and variance:

$\hat{\beta}_{t|t} = \hat{\beta}_{t|t-1} + K_t v_t$

$P_{t|t} = (1 - K_t f_t) P_{t|t-1}$

Note that $K_t f_t \in [0, 1]$, so the update always reduces uncertainty.

Step 4 -- Log-likelihood (prediction-error decomposition):

Each innovation $v_t \sim N(0, F_t)$ under the model. The log-likelihood is:

$\ell(\sigma_\varepsilon^2, \sigma_\eta^2) = -\frac{T}{2}\ln(2\pi) - \frac{1}{2}\sum_{t=1}^{T}\left[\ln F_t + \frac{v_t^2}{F_t}\right]$

This is the beauty of the prediction-error decomposition: you get the likelihood as a byproduct of running the filter forward.

Step 5 -- MLE for hyperparameters:

To estimate $(\sigma_\varepsilon^2, \sigma_\eta^2)$:

  1. Initialize $\hat{\beta}_{0|0}$ and $P_{0|0}$ (diffuse prior: large $P_{0|0}$, or use the first few observations).
  2. For candidate values $(\sigma_\varepsilon^2, \sigma_\eta^2)$, run the Kalman filter forward through all $T$ observations, computing $v_t$ and $F_t$ at each step.
  3. Evaluate the log-likelihood $\ell(\sigma_\varepsilon^2, \sigma_\eta^2)$.
  4. Maximize $\ell$ over the two hyperparameters using numerical optimization (e.g., L-BFGS-B with positivity constraints, or optimize over log-transformed parameters).

The signal-to-noise ratio $\sigma_\eta^2 / \sigma_\varepsilon^2$ controls how fast beta moves. If $\sigma_\eta^2 \to 0$, beta is nearly constant and the Kalman gain shrinks to zero. If $\sigma_\eta^2$ is large, the filter trusts each new observation heavily.

Practical Considerations: - Initialization matters for short series. A diffuse prior ($P_{0|0}$ very large) is standard. - The likelihood surface can be flat in the $\sigma_\eta^2$ direction, making optimization tricky. Grid search over the signal-to-noise ratio first can help. - In practice, you would also check whether the innovations $v_t / \sqrt{F_t}$ look like i.i.d. standard normals as a diagnostic.

Answer: The Kalman filter recursions are: predict $\hat{\beta}_{t|t-1} = \hat{\beta}_{t-1|t-1}$, $P_{t|t-1} = P_{t-1|t-1} + \sigma_\eta^2$; update with gain $K_t = P_{t|t-1} f_t / (f_t^2 P_{t|t-1} + \sigma_\varepsilon^2)$, giving $\hat{\beta}_{t|t} = \hat{\beta}_{t|t-1} + K_t(r_t - f_t \hat{\beta}_{t|t-1})$ and $P_{t|t} = (1 - K_t f_t) P_{t|t-1}$. The log-likelihood uses the prediction-error decomposition, and the hyperparameters are estimated by numerically maximizing this likelihood.

Intuition

The Kalman filter is the workhorse of time-varying parameter estimation in finance. It captures a fundamental tension: you want your beta estimate to be responsive to genuine regime changes but not whipsawed by noise. The signal-to-noise ratio $\sigma_\eta^2 / \sigma_\varepsilon^2$ is the single knob that controls this tradeoff. When this ratio is small, the filter behaves like OLS with a long lookback window. When it is large, the filter behaves like a very short moving average. MLE chooses the ratio that best explains the observed pattern of prediction errors.

This framework shows up constantly in practice -- not just for betas, but for any slowly drifting parameter: volatility, correlations, factor loadings, even alpha signals. The prediction-error decomposition is also the basis for model comparison: you can compare a constant-beta model (which sets $\sigma_\eta^2 = 0$) against a time-varying one by comparing log-likelihoods, giving you a principled way to decide whether the added complexity is justified.

Open the full interactive solver →