Yule-Walker Equations, PACF, and AR Order Selection
Suppose $r_t$ is a covariance-stationary process that follows an $\text{AR}(p)$ model with unknown order $p$:
$r_t = \phi_1 r_{t-1} + \phi_2 r_{t-2} + \cdots + \phi_p r_{t-p} + \varepsilon_t$
where $\varepsilon_t$ is white noise with variance $\sigma^2$.
- Derive the Yule-Walker equations that relate the autocovariances $\gamma(h)$ to the AR coefficients $\phi_1, \ldots, \phi_p$. Then derive the Durbin-Levinson recursion, which lets you compute the partial autocorrelation function (PACF) iteratively without solving the full system at each order.
- Using the PACF, propose a consistent rule for selecting the true order $p$. Discuss how information criteria (AIC, BIC) provide an alternative or complement to PACF cutoff. What are the small-sample pitfalls of each approach?
Hints
- Multiply both sides of the AR equation by $r_{t-h}$ and take expectations -- what happens to the noise term when $h \geq 1$?
- The Durbin-Levinson recursion builds the order-$m$ fit from the order-$(m-1)$ fit. The key quantity is the residual autocovariance at lag $m$ not explained by the shorter model, normalized by the prediction error variance.
- For a true $\text{AR}(p)$, the PACF $\phi_{m,m} = 0$ for $m > p$. Under this null, $\hat{\phi}_{m,m}$ is approximately $N(0, 1/T)$, giving you a $\pm 1.96/\sqrt{T}$ significance band.
Worked Solution
How to Think About It: The Yule-Walker equations are just what you get when you multiply both sides of the AR equation by $r_{t-h}$ and take expectations -- you are exploiting the fact that the noise $\varepsilon_t$ is uncorrelated with past values. The beauty of the Durbin-Levinson recursion is that it lets you "grow" the AR fit from order 1 to order 2 to order $m$ without re-solving the entire system each time. The last coefficient you add at each step is exactly the partial autocorrelation at that lag -- it measures how much new predictive information that lag contributes after accounting for all shorter lags. For order selection, the key signature of an $\text{AR}(p)$ process is that the PACF cuts off sharply after lag $p$.
Part 1: Deriving the Yule-Walker Equations
Start from the model:
$r_t = \sum_{j=1}^{p} \phi_j r_{t-j} + \varepsilon_t$
Multiply both sides by $r_{t-h}$ (for $h \geq 1$) and take expectations. Since $r_t$ is covariance-stationary, $E[r_t r_{t-h}] = \gamma(h)$, and since $\varepsilon_t$ is uncorrelated with $r_{t-h}$ for $h \geq 1$:
$\gamma(h) = \sum_{j=1}^{p} \phi_j \gamma(h - j), \quad h = 1, 2, \ldots, p$
In matrix form, writing $\boldsymbol{\gamma} = (\gamma(1), \ldots, \gamma(p))^\top$ and $\boldsymbol{\phi} = (\phi_1, \ldots, \phi_p)^\top$:
$\boldsymbol{\Gamma}_p \boldsymbol{\phi} = \boldsymbol{\gamma}$
where $\boldsymbol{\Gamma}_p$ is the $p \times p$ Toeplitz matrix with $(i,j)$-entry $\gamma(|i - j|)$:
$\boldsymbol{\Gamma}_p = \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(p-1) \\ \gamma(1) & \gamma(0) & \cdots & \gamma(p-2) \\ \vdots & & \ddots & \vdots \\ \gamma(p-1) & \gamma(p-2) & \cdots & \gamma(0) \end{pmatrix}$
The solution is $\boldsymbol{\phi} = \boldsymbol{\Gamma}_p^{-1} \boldsymbol{\gamma}$. The noise variance follows from the $h = 0$ equation:
$\sigma^2 = \gamma(0) - \sum_{j=1}^{p} \phi_j \gamma(j)$
Part 1 (continued): The Durbin-Levinson Recursion
The Durbin-Levinson algorithm avoids inverting $\boldsymbol{\Gamma}_m$ from scratch at each order $m$. Let $\phi_{m,j}$ denote the $j$-th AR coefficient in the best linear predictor of order $m$, and let $v_m$ be the corresponding prediction error variance.
Initialize: $v_0 = \gamma(0)$.
For $m = 1, 2, 3, \ldots$:
- Compute the $m$-th partial autocorrelation:
$\phi_{m,m} = \frac{\gamma(m) - \sum_{j=1}^{m-1} \phi_{m-1,j}\, \gamma(m - j)}{v_{m-1}}$
- Update the intermediate coefficients for $j = 1, \ldots, m-1$:
$\phi_{m,j} = \phi_{m-1,j} - \phi_{m,m}\, \phi_{m-1,m-j}$
- Update the prediction error variance:
$v_m = v_{m-1}(1 - \phi_{m,m}^2)$
The quantity $\phi_{m,m}$ is the partial autocorrelation at lag $m$ -- it is the correlation between $r_t$ and $r_{t-m}$ after removing the linear effect of $r_{t-1}, \ldots, r_{t-m+1}$. For a true $\text{AR}(p)$ process, $\phi_{m,m} = 0$ for all $m > p$.
Why it works: At each step, the numerator of $\phi_{m,m}$ computes the residual autocovariance at lag $m$ that is not explained by the order-$(m-1)$ fit. Dividing by $v_{m-1}$ normalizes it into a correlation. The update formula for $\phi_{m,j}$ follows from the block matrix identity relating $\boldsymbol{\Gamma}_m^{-1}$ to $\boldsymbol{\Gamma}_{m-1}^{-1}$ -- it is a Schur complement argument on the Toeplitz structure.
Part 2: Order Selection
PACF cutoff rule: For a true $\text{AR}(p)$ process, the theoretical PACF is exactly zero for lags
$\hat{\phi}_{m,m} \stackrel{\text{approx}}{\sim} N\!\left(0, \frac{1}{T}\right)$
So you select $\hat{p}$ as the largest lag where $|\hat{\phi}_{m,m}|$ exceeds the $95\%$ band $\pm 1.96 / \sqrt{T}$. This is consistent: the true PACF values at lags $\leq p$ are nonzero and the sample estimates converge, while the estimates at lags
Information criteria approach: Fit AR models of orders $m = 0, 1, \ldots, m_{\max}$ and select the order minimizing:
$\text{AIC}(m) = \ln \hat{\sigma}_m^2 + \frac{2m}{T}, \qquad \text{BIC}(m) = \ln \hat{\sigma}_m^2 + \frac{m \ln T}{T}$
where $\hat{\sigma}_m^2$ is the estimated noise variance from the order-$m$ fit. BIC is consistent (selects the true order with probability 1 as $T \to \infty$) because its penalty grows with $T$. AIC tends to overfit -- it has a nonzero probability of selecting too large an order even asymptotically -- but it can yield better out-of-sample forecasts in finite samples because it balances bias and variance rather than targeting the true model.
Small-sample pitfalls:
- PACF cutoff: With short series ($T < 100$), the confidence bands $\pm 1.96/\sqrt{T}$ are wide, making it hard to distinguish small but nonzero partial autocorrelations from noise. You may underfit. Also, if the true process is near the stationarity boundary (roots of the characteristic polynomial near the unit circle), convergence of $\hat{\phi}_{m,m}$ is slow and the normal approximation is poor.
- AIC: Overfits in small samples. With $T = 50$ and $m_{\max} = 10$, the penalty