Yule-Walker Equations, PACF, and AR Order Selection

Time Series · Hard · Free problem

Suppose $r_t$ is a covariance-stationary process that follows an $\text{AR}(p)$ model with unknown order $p$:

$r_t = \phi_1 r_{t-1} + \phi_2 r_{t-2} + \cdots + \phi_p r_{t-p} + \varepsilon_t$

where $\varepsilon_t$ is white noise with variance $\sigma^2$.

  1. Derive the Yule-Walker equations that relate the autocovariances $\gamma(h)$ to the AR coefficients $\phi_1, \ldots, \phi_p$. Then derive the Durbin-Levinson recursion, which lets you compute the partial autocorrelation function (PACF) iteratively without solving the full system at each order.
  1. Using the PACF, propose a consistent rule for selecting the true order $p$. Discuss how information criteria (AIC, BIC) provide an alternative or complement to PACF cutoff. What are the small-sample pitfalls of each approach?

Hints

  1. Multiply both sides of the AR equation by $r_{t-h}$ and take expectations -- what happens to the noise term when $h \geq 1$?
  2. The Durbin-Levinson recursion builds the order-$m$ fit from the order-$(m-1)$ fit. The key quantity is the residual autocovariance at lag $m$ not explained by the shorter model, normalized by the prediction error variance.
  3. For a true $\text{AR}(p)$, the PACF $\phi_{m,m} = 0$ for $m > p$. Under this null, $\hat{\phi}_{m,m}$ is approximately $N(0, 1/T)$, giving you a $\pm 1.96/\sqrt{T}$ significance band.

Worked Solution

How to Think About It: The Yule-Walker equations are just what you get when you multiply both sides of the AR equation by $r_{t-h}$ and take expectations -- you are exploiting the fact that the noise $\varepsilon_t$ is uncorrelated with past values. The beauty of the Durbin-Levinson recursion is that it lets you "grow" the AR fit from order 1 to order 2 to order $m$ without re-solving the entire system each time. The last coefficient you add at each step is exactly the partial autocorrelation at that lag -- it measures how much new predictive information that lag contributes after accounting for all shorter lags. For order selection, the key signature of an $\text{AR}(p)$ process is that the PACF cuts off sharply after lag $p$.

Part 1: Deriving the Yule-Walker Equations

Start from the model:

$r_t = \sum_{j=1}^{p} \phi_j r_{t-j} + \varepsilon_t$

Multiply both sides by $r_{t-h}$ (for $h \geq 1$) and take expectations. Since $r_t$ is covariance-stationary, $E[r_t r_{t-h}] = \gamma(h)$, and since $\varepsilon_t$ is uncorrelated with $r_{t-h}$ for $h \geq 1$:

$\gamma(h) = \sum_{j=1}^{p} \phi_j \gamma(h - j), \quad h = 1, 2, \ldots, p$

In matrix form, writing $\boldsymbol{\gamma} = (\gamma(1), \ldots, \gamma(p))^\top$ and $\boldsymbol{\phi} = (\phi_1, \ldots, \phi_p)^\top$:

$\boldsymbol{\Gamma}_p \boldsymbol{\phi} = \boldsymbol{\gamma}$

where $\boldsymbol{\Gamma}_p$ is the $p \times p$ Toeplitz matrix with $(i,j)$-entry $\gamma(|i - j|)$:

$\boldsymbol{\Gamma}_p = \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(p-1) \\ \gamma(1) & \gamma(0) & \cdots & \gamma(p-2) \\ \vdots & & \ddots & \vdots \\ \gamma(p-1) & \gamma(p-2) & \cdots & \gamma(0) \end{pmatrix}$

The solution is $\boldsymbol{\phi} = \boldsymbol{\Gamma}_p^{-1} \boldsymbol{\gamma}$. The noise variance follows from the $h = 0$ equation:

$\sigma^2 = \gamma(0) - \sum_{j=1}^{p} \phi_j \gamma(j)$

Part 1 (continued): The Durbin-Levinson Recursion

The Durbin-Levinson algorithm avoids inverting $\boldsymbol{\Gamma}_m$ from scratch at each order $m$. Let $\phi_{m,j}$ denote the $j$-th AR coefficient in the best linear predictor of order $m$, and let $v_m$ be the corresponding prediction error variance.

Initialize: $v_0 = \gamma(0)$.

For $m = 1, 2, 3, \ldots$:

  1. Compute the $m$-th partial autocorrelation:

$\phi_{m,m} = \frac{\gamma(m) - \sum_{j=1}^{m-1} \phi_{m-1,j}\, \gamma(m - j)}{v_{m-1}}$

  1. Update the intermediate coefficients for $j = 1, \ldots, m-1$:

$\phi_{m,j} = \phi_{m-1,j} - \phi_{m,m}\, \phi_{m-1,m-j}$

  1. Update the prediction error variance:

$v_m = v_{m-1}(1 - \phi_{m,m}^2)$

The quantity $\phi_{m,m}$ is the partial autocorrelation at lag $m$ -- it is the correlation between $r_t$ and $r_{t-m}$ after removing the linear effect of $r_{t-1}, \ldots, r_{t-m+1}$. For a true $\text{AR}(p)$ process, $\phi_{m,m} = 0$ for all $m > p$.

Why it works: At each step, the numerator of $\phi_{m,m}$ computes the residual autocovariance at lag $m$ that is not explained by the order-$(m-1)$ fit. Dividing by $v_{m-1}$ normalizes it into a correlation. The update formula for $\phi_{m,j}$ follows from the block matrix identity relating $\boldsymbol{\Gamma}_m^{-1}$ to $\boldsymbol{\Gamma}_{m-1}^{-1}$ -- it is a Schur complement argument on the Toeplitz structure.

Part 2: Order Selection

PACF cutoff rule: For a true $\text{AR}(p)$ process, the theoretical PACF is exactly zero for lags

Loading problems...
gt; p$. In practice, estimate the sample PACF $\hat{\phi}_{m,m}$ using sample autocovariances in the Durbin-Levinson recursion. Under the null that the true order is $p$, for $m > p$:

$\hat{\phi}_{m,m} \stackrel{\text{approx}}{\sim} N\!\left(0, \frac{1}{T}\right)$

So you select $\hat{p}$ as the largest lag where $|\hat{\phi}_{m,m}|$ exceeds the $95\%$ band $\pm 1.96 / \sqrt{T}$. This is consistent: the true PACF values at lags $\leq p$ are nonzero and the sample estimates converge, while the estimates at lags

Loading problems...
gt; p$ shrink to zero.

Information criteria approach: Fit AR models of orders $m = 0, 1, \ldots, m_{\max}$ and select the order minimizing:

$\text{AIC}(m) = \ln \hat{\sigma}_m^2 + \frac{2m}{T}, \qquad \text{BIC}(m) = \ln \hat{\sigma}_m^2 + \frac{m \ln T}{T}$

where $\hat{\sigma}_m^2$ is the estimated noise variance from the order-$m$ fit. BIC is consistent (selects the true order with probability 1 as $T \to \infty$) because its penalty grows with $T$. AIC tends to overfit -- it has a nonzero probability of selecting too large an order even asymptotically -- but it can yield better out-of-sample forecasts in finite samples because it balances bias and variance rather than targeting the true model.

Small-sample pitfalls:

Answer: The Yule-Walker equations are $\boldsymbol{\Gamma}_p \boldsymbol{\phi} = \boldsymbol{\gamma}$, derived by multiplying the AR equation by lagged values and taking expectations. The Durbin-Levinson recursion computes the PACF iteratively: $\phi_{m,m}$ equals the residual autocovariance at lag $m$ (after the order-$(m-1)$ fit) divided by the prediction error variance $v_{m-1}$. For order selection, a consistent approach is to use the PACF cutoff (select the last significant lag beyond $\pm 1.96/\sqrt{T}$) or minimize BIC. In small samples, combine both: use the PACF plot to narrow the candidate range, then let BIC choose within that range.

Intuition

The Yule-Walker equations are the workhorse of linear time series analysis. They translate the recursive structure of an AR process into a linear system involving the autocovariance function, which you can estimate from data. The Durbin-Levinson recursion is the clever part: instead of solving a growing linear system from scratch at each candidate order, you update the solution incrementally. The last coefficient added at each step -- the partial autocorrelation -- has a clean interpretation: it is the marginal explanatory power of lag $m$ after all shorter lags have been accounted for. This is directly analogous to the $t$-statistic on the last regressor added to a regression.

In practice, order selection is where most mistakes happen. The PACF cutoff is intuitive and works well for clean AR processes, but real financial data rarely cooperates -- you get borderline partial autocorrelations, seasonal patterns, and structural breaks that make the cutoff ambiguous. Information criteria provide an automated alternative, but each has its bias: AIC leans toward overfitting, BIC toward underfitting in small samples. Experienced practitioners typically plot the PACF to get a qualitative sense of the order, then confirm with BIC, and always check that the residuals from the selected model look like white noise. The meta-lesson is that model selection is never purely mechanical -- the diagnostics are a conversation, not a verdict.

Open the full interactive solver →