Newey-West HAC Estimator in AR(1) Noise

Statistics · Hard · Free problem

You are running a predictive regression

$r_t = \beta \, s_{t-1} + \varepsilon_t$

where the error term follows an AR(1) process $\varepsilon_t = \rho \, \varepsilon_{t-1} + u_t$ with $u_t \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)$.

Because the errors are serially correlated, the usual OLS standard errors for $\hat{\beta}$ are wrong. You want to use a Newey-West HAC (heteroskedasticity-and-autocorrelation-consistent) estimator to get a valid estimate of $\text{Var}(\hat{\beta})$.

Derive the Newey-West estimator of $\text{Var}(\hat{\beta})$ using the Bartlett kernel with bandwidth $L$.

Show that the estimator has a leading bias of order $O(L/n)$ and identify the form of this bias.

Discuss how to choose $L$ to balance the bias-variance trade-off, and state the standard rate for optimal bandwidth as a function of $n$.

Hints

Start by writing $\text{Var}(\hat{\beta})$ in terms of the autocovariance function $\gamma_j$ of the errors. The key object you need to estimate is the long-run variance $\Omega = \sum_{j=-\infty}^{\infty} \gamma_j$.
The Bartlett kernel applies weights $w_j = 1 - j/(L+1)$ to the sample autocovariances. Think about what bias this taper introduces compared to using the raw autocovariances -- especially the contribution from the $j/(L+1)$ term.
To find the optimal $L$, set up the MSE as $O(L^2/n^2) + O(L/n)$ and minimize. The variance term $O(L/n)$ comes from summing $L$ noisy autocovariance estimates.

Worked Solution

How to Think About It: The core issue is straightforward: OLS assumes i.i.d. errors, but your errors have autocorrelation $\rho$. The true variance of $\hat{\beta}$ depends on the entire autocovariance structure of $\varepsilon_t$. In practice, you do not know $\rho$ -- and even if you did, the real world has more complicated serial dependence than a clean AR(1). The Newey-West estimator is the workhorse solution: estimate the autocovariances from the residuals and weight them with a kernel that tapers off at higher lags. The Bartlett kernel (linear taper to zero at lag $L$) is the classic choice because it guarantees a positive semi-definite variance estimate.

Quick Estimate: For an AR(1) with autocorrelation $\rho$, the true long-run variance of $\varepsilon_t$ is $\sigma^2 / (1 - \rho)^2$. If $\rho = 0.5$ and $\sigma^2 = 1$, that is

/ 0.25 = 4$. The naive OLS variance (which ignores autocorrelation) would use $\sigma^2 = 1$, underestimating by a factor of 4. With $n = 200$ and bandwidth $L = 6$, the relative bias from the kernel is roughly $O(L/n) = O(6/200) = 3\%$, which is small compared to the factor-of-4 error from ignoring autocorrelation entirely.

Formal Derivation:

Start with the OLS estimator:

$\hat{\beta} = \beta + \frac{\sum_{t=1}^{n} s_{t-1} \varepsilon_t}{\sum_{t=1}^{n} s_{t-1}^2}$

The exact variance is:

$\text{Var}(\hat{\beta}) = \frac{\sum_{t=1}^{n} \sum_{r=1}^{n} s_{t-1} s_{r-1} \, \text{Cov}(\varepsilon_t, \varepsilon_r)}{\left(\sum_{t=1}^{n} s_{t-1}^2\right)^2}$

Define the autocovariance $\gamma_j = \text{Cov}(\varepsilon_t, \varepsilon_{t-j})$. For the AR(1) process, $\gamma_j = \sigma^2 \rho^{|j|} / (1 - \rho^2)$. The key quantity in the numerator is the long-run variance:

$\Omega = \sum_{j=-\infty}^{\infty} \gamma_j = \frac{\sigma^2}{(1 - \rho)^2}$

The Newey-West estimator replaces this with a kernel-weighted sum of sample autocovariances. Using the Bartlett kernel with bandwidth $L$:

$\hat{\Omega}_{NW} = \hat{\gamma}_0 + 2 \sum_{j=1}^{L} w_j \, \hat{\gamma}_j$

where the Bartlett weights are $w_j = 1 - j/(L+1)$ and the sample autocovariances are:

$\hat{\gamma}_j = \frac{1}{n} \sum_{t=j+1}^{n} \hat{\varepsilon}_t \, \hat{\varepsilon}_{t-j}$

with $\hat{\varepsilon}_t$ being the OLS residuals. The Newey-West variance estimate is then:

$\widehat{\text{Var}}(\hat{\beta}) = \frac{\hat{\Omega}_{NW}}{\left(\frac{1}{n} \sum_{t=1}^{n} s_{t-1}^2\right)^2 \cdot n}$

Bias Analysis:

The bias comes from two sources: (a) the kernel truncates autocovariances beyond lag $L$, and (b) the Bartlett weights down-weight the included lags.

The true long-run variance is $\Omega = \gamma_0 + 2 \sum_{j=1}^{\infty} \gamma_j$. The expected value of $\hat{\Omega}_{NW}$ (replacing sample autocovariances with population values) is:

$E[\hat{\Omega}_{NW}] \approx \gamma_0 + 2 \sum_{j=1}^{L} \left(1 - \frac{j}{L+1}\right) \gamma_j$

The bias is:

$\text{Bias} = E[\hat{\Omega}_{NW}] - \Omega = -2 \sum_{j=1}^{L} \frac{j}{L+1} \gamma_j - 2 \sum_{j=L+1}^{\infty} \gamma_j$

For the AR(1) process where $\gamma_j$ decays geometrically, the truncation term (second sum) is exponentially small for moderate $L$. The dominant bias term comes from the Bartlett taper. Using summation-by-parts and the smoothness of $\gamma_j$:

$\text{Bias} \approx -\frac{1}{L+1} \sum_{j=1}^{L} j \, \gamma_j \approx -\frac{L}{n} \cdot C(\rho, \sigma^2)$

where $C(\rho, \sigma^2)$ is a constant depending on the spectral density curvature. More precisely, the leading bias is $O(L/n)$ and is proportional to the second derivative of the spectral density at frequency zero. The Bartlett kernel has a non-zero spectral window bias because it is not a higher-order kernel.

Bias-Variance Trade-off and Optimal Bandwidth:

Bias: $O(L/n)$ -- increases with $L$. A larger bandwidth means more down-weighting (more taper bias) and you are averaging more lags.
Variance: Each $\hat{\gamma}_j$ is estimated with roughly $O(1/n)$ variance. You are summing $L$ such terms, so the variance of $\hat{\Omega}_{NW}$ is $O(L/n)$.

The MSE is:

$\text{MSE} = \text{Bias}^2 + \text{Var} \approx c_1 \frac{L^2}{n^2} + c_2 \frac{L}{n}$

Minimizing over $L$ by taking the derivative and setting it to zero:

$2 c_1 \frac{L}{n^2} + c_2 \frac{1}{n} = 0 \implies L^{*} \propto n^{1/3}$

This is the Andrews (1991) / Newey-West (1994) result: the optimal bandwidth for the Bartlett kernel grows at rate $n^{1/3}$. In practice, a common plug-in rule uses the AR(1) coefficient $\hat{\rho}$ to set:

$L^{*} = \left\lfloor \left( \frac{3n}{2} \right)^{1/3} \left( \frac{2\hat{\rho}}{1 - \hat{\rho}^2} \right)^{2/3} \right\rfloor$

Answer:

The Newey-West HAC estimator uses $\hat{\Omega}_{NW} = \hat{\gamma}_0 + 2 \sum_{j=1}^{L} (1 - j/(L+1)) \hat{\gamma}_j$ with Bartlett weights to estimate the long-run variance. The leading bias is $O(L/n)$, arising from the kernel's taper. The variance of the estimator is $O(L/n)$, so the MSE-optimal bandwidth scales as $L^{*} \propto n^{1/3}$. Choosing $L$ too small underestimates the long-run variance (truncation bias); choosing $L$ too large inflates estimation noise. The $n^{1/3}$ rate is the standard result for the Bartlett kernel.

Intuition

The Newey-West estimator is one of the most important tools in applied time series econometrics, and the reason it shows up in quant interviews is that nearly every predictive regression in finance has autocorrelated residuals -- whether from overlapping returns, slow-moving factors, or model misspecification. The fundamental tension is simple: you need to estimate how much serial correlation inflates your standard errors, but estimating the autocovariance function itself introduces noise. The Bartlett kernel is the conservative choice because it always gives you a positive semi-definite estimate (so you never get a negative variance), but it pays for this with a non-trivial bias that scales with bandwidth.

The $n^{1/3}$ optimal rate is worth memorizing because it comes up constantly and tells you something practical: bandwidth should grow slowly with sample size. With daily data over 5 years ($n \approx 1250$), you'd use roughly $L \approx 10$-

5$ lags. The common mistake is picking $L$ too large because "more autocorrelation structure is better" -- but in finite samples, each extra lag you include adds estimation noise, and the Bartlett taper cannot fully control it. When an interviewer asks about HAC estimation, they want to hear you articulate this bias-variance trade-off concretely, not just recite the formula.

Open the full interactive solver →