Predictive Content of Order Imbalance
You are testing whether signed order flow predicts short-horizon returns. Let $r_{t+1}$ be the return over the next period and $q_t$ be signed volume (buyer-initiated minus seller-initiated trades, normalized) at time $t$. You run an OLS regression:
$r_{t+1} = \alpha + \beta q_t + \varepsilon_{t+1}$
Because financial returns and order flow both exhibit heteroskedasticity and serial correlation, standard OLS standard errors are invalid. Work through the following:
- Derive the Newey-West HAC $t$-statistic for $H_0: \beta = 0$ with bandwidth $L$. Write out the HAC covariance matrix estimator explicitly and show how the $t$-stat is formed.
- Explain how to select the lag $L$ as a function of sample size $T$ and the autocorrelation structure of the residuals. What are the practical trade-offs?
- Describe an out-of-sample forecasting test using a rolling estimation window. Define what you would report: forecast $R^2$ and the Diebold-Mariano test statistic. What null hypothesis does the DM test evaluate, and what does rejection mean here?
Hints
- The OLS estimate $\hat{\beta}$ is unchanged -- only its standard error needs correction. Think of the HAC variance as the usual sandwich $(X'X)^{-1} \hat{\Omega} (X'X)^{-1}$, where $\hat{\Omega}$ sums up weighted sample autocovariance matrices of the score $x_t \hat{\varepsilon}_t$.
- For lag selection, the Newey-West rule of thumb is $L \approx 4(T/100)^{2/9}$; the Andrews plug-in requires fitting an AR(1) to residuals and using the implied spectral density at frequency zero. Both balance bias (too-small $L$) against variance (too-large $L$).
- For the out-of-sample test, form rolling forecasts $\hat{r}_{s+1} = \hat{\alpha}_s + \hat{\beta}_s q_s$ and compare to the historical mean benchmark. Forecast $R^2_{\text{OOS}} = 1 - \sum \hat{e}_s^2 / \sum \tilde{e}_s^2$; the Diebold-Mariano statistic tests $H_0: E[d_s] = 0$ where $d_s = \tilde{e}_s^2 - \hat{e}_s^2$, using a HAC variance for $\bar{d}$.
Worked Solution
How to Think About It: Order imbalance is one of the most-studied predictors in microstructure -- if you buy more than you sell, prices tend to drift up. The key statistical challenge is that both $q_t$ and $\varepsilon_{t+1}$ are autocorrelated and heteroskedastic: intraday volatility clusters, order flow is persistent, and residuals inherit both. Standard OLS SEs will be too small, making $\hat{\beta}$ look significant when it is not. Newey-West is the standard fix. This is a three-part problem: derive the estimator, pick the bandwidth, then design the out-of-sample test. Each part has a common mistake -- get them right and you have shown you understand how empirical finance actually works.
Key Insight: OLS gives you $\hat{\beta}$ -- that part is fine. The problem is the variance of $\hat{\beta}$. The HAC estimator corrects this variance for serial correlation and heteroskedasticity in the residuals, up to lag $L$. Everything flows from that correction.
Part 1: Newey-West HAC $t$-Statistic
Recall the OLS estimator: $\hat{\beta} = (X'X)^{-1} X'y$, where $X$ is the $T \times 2$ matrix of $[1, q_t]$ observations. The sandwich variance estimator is:
$\text{Var}(\hat{\beta}) = (X'X)^{-1} \hat{\Omega} (X'X)^{-1}$
where $\hat{\Omega}$ is the HAC covariance matrix of $X'\varepsilon$. Under standard OLS, $\hat{\Omega} = \hat{\sigma}^2 X'X$. The Newey-West correction replaces this with:
$\hat{\Omega}_{\text{NW}} = \hat{\Gamma}_0 + \sum_{j=1}^{L} w_j \left( \hat{\Gamma}_j + \hat{\Gamma}_j' \right)$
where the Bartlett (triangular) kernel weights are $w_j = 1 - \frac{j}{L+1}$, and the sample autocovariance matrices are:
$\hat{\Gamma}_j = \frac{1}{T} \sum_{t=j+1}^{T} \hat{\varepsilon}_t \hat{\varepsilon}_{t-j} x_t x_{t-j}'$
Here $\hat{\varepsilon}_t = r_{t+1} - \hat{\alpha} - \hat{\beta} q_t$ are OLS residuals and $x_t = [1, q_t]'$.
The full sandwich variance of $\hat{\beta}$ (the second element) is extracted as the $(2,2)$ element of:
$\widehat{\text{Var}}_{\text{NW}}(\hat{\beta}) = \left[(X'X)^{-1} \hat{\Omega}_{\text{NW}} (X'X)^{-1}\right]_{22}$
The Newey-West $t$-statistic for $H_0: \beta = 0$ is then:
$t_{\text{NW}} = \frac{\hat{\beta}}{\sqrt{\widehat{\text{Var}}_{\text{NW}}(\hat{\beta})}}$
Under $H_0$, this is asymptotically $N(0,1)$ as $T \to \infty$ with $L/T^{1/3} \to 0$ (or similar regularity). In practice, use critical values from $t_{T-2}$ for finite samples, but at the sample sizes typical in microstructure research ($T > 500$), the normal approximation is fine.
Part 2: Lag Selection
The choice of $L$ is a bias-variance trade-off. Too small: you miss real serial correlation in the residuals, so $\hat{\Omega}_{\text{NW}}$ is downward biased and the $t$-stat is inflated. Too large: you include noisy autocovariance estimates at long lags, inflating the variance of the estimator and losing power.
The standard automatic bandwidth rules:
- Newey-West (1994) rule of thumb: $L = \lfloor 4(T/100)^{2/9} \rfloor$. For $T = 1000$, this gives $L \approx 7$.
- Andrews (1991) plug-in: Fit an AR(1) to the residuals, estimate the first-order autocorrelation $\hat{\rho}$, then use $L^{*} = 1.1447 \left( \frac{\hat{\rho}}{1 - \hat{\rho}^2} \cdot T \right)^{1/3}$.
- Practical floor: If you are testing return predictability at a 5-minute horizon, set $L$ to at least the number of intraday periods (e.g., $L = 78$ for a full trading day of 5-minute bars). You do not want to miss within-day autocorrelation.
Common mistake: picking $L = 1$ or $L = 5$ because it "seems reasonable" without checking the residual autocorrelation function. Always plot the residual ACF and ensure your $L$ covers the significant lags. If autocorrelation decays slowly (long memory), consider Newey-West with a larger bandwidth or a Parzen kernel.
Part 3: Out-of-Sample Rolling Window Test
In-sample regression $t$-stats are notorious for over-fitting. The gold standard in empirical finance is an out-of-sample test.
*Setup:* Divide the sample into an initial estimation window of $T_0$ observations and a forecast evaluation period of $T_1 = T - T_0$ periods. At each time $s$ in the evaluation period: 1. Estimate $\hat{\alpha}_s, \hat{\beta}_s$ using the most recent $T_0$ observations: $\{s - T_0, \ldots, s-1\}$. 2. Form the one-step-ahead forecast: $\hat{r}_{s+1} = \hat{\alpha}_s + \hat{\beta}_s q_s$. 3. Record the benchmark forecast: $\bar{r}_{s+1} = \bar{r}_s$ (the rolling historical mean -- the "no predictability" null).
This produces $T_1$ pairs of forecast errors: $\hat{e}_s = r_{s+1} - \hat{r}_{s+1}$ and $\tilde{e}_s = r_{s+1} - \bar{r}_{s+1}$.
*Forecast $R^2$ (Campbell-Thompson):*
$R^2_{\text{OOS}} = 1 - \frac{\sum_{s} \hat{e}_s^2}{\sum_{s} \tilde{e}_s^2}$
If $R^2_{\text{OOS}} > 0$, the order imbalance model beats the historical mean. Note: $R^2_{\text{OOS}}$ can be negative even when the in-sample $R^2$ is positive -- this is the overfitting penalty.
*Diebold-Mariano Test:*
Define the loss differential $d_s = \tilde{e}_s^2 - \hat{e}_s^2$ (positive when the benchmark loses more). The DM statistic tests $H_0: E[d_s] = 0$ (equal predictive accuracy):
$DM = \frac{\bar{d}}{\sqrt{\widehat{\text{Var}}(\bar{d}) / T_1}}$
where $\bar{d} = T_1^{-1} \sum_s d_s$ and $\widehat{\text{Var}}(\bar{d})$ uses a HAC (Newey-West) estimator of the long-run variance of $d_s$ (because loss differentials are themselves autocorrelated). Under $H_0$, $DM \xrightarrow{d} N(0,1)$.
Rejection of $H_0$ in the direction $\bar{d} > 0$ means the order imbalance model has statistically significantly lower MSE than the historical mean -- genuine out-of-sample predictability.
Answer: (i) $t_{\text{NW}} = \hat{\beta} / \sqrt{\widehat{\text{Var}}_{\text{NW}}(\hat{\beta})}$ where the HAC variance uses Bartlett-weighted autocovariance matrices up to lag $L$. (ii) $L \approx 4(T/100)^{2/9}$ or Andrews plug-in; always verify against residual ACF. (iii) Rolling-window OOS test: report $R^2_{\text{OOS}} = 1 - \text{MSPE}_{\text{model}} / \text{MSPE}_{\text{benchmark}}$ and DM test of equal predictive accuracy with HAC standard errors on loss differentials.
Intuition
The Newey-West correction is a lesson in what OLS actually guarantees. OLS gives you unbiased coefficient estimates even with autocorrelated and heteroskedastic residuals -- the Gauss-Markov theorem only breaks down for efficiency, not consistency. What breaks immediately is the inference: standard errors shrink proportionally to
The out-of-sample test is the harder discipline. In-sample $R^2$ in return prediction is almost always meaningless -- you have enough degrees of freedom to fit noise. The $R^2_{\text{OOS}}$ is a real economic test: can you actually trade on this signal? Even a small positive $R^2_{\text{OOS}}$ (say 0.3%) can be economically significant at high frequency, while a large in-sample $R^2$ that collapses out-of-sample tells you the model was mining the data. The DM test adds the statistical rigor: it tells you whether the outperformance of the order imbalance model over the historical mean is large enough to reject the null of equal accuracy, accounting for the serial correlation in the loss differential series.