HAC vs Two-Way Clustered Standard Errors

Regression · Hard · Free problem

You run a panel regression of returns $y_{i,t}$ on features $x_{i,t}$ across $N$ assets and $T$ time periods. The residuals exhibit both serial correlation (within each asset over time) and cross-sectional correlation (across assets at each date).

Explain why standard OLS standard errors can be badly wrong when both serial correlation and cross-sectional correlation are present. What goes wrong with the usual $\hat{\sigma}^2 (X'X)^{-1}$ formula?

Define a two-way clustered covariance estimator that clusters by both asset and time. Write down the sandwich form and state precisely which dependence structures it is robust to (and which it is not).

Give one concrete failure mode of two-way clustering when $T$ is small (say $T = 20$), and propose a practical remedy.

Hints

Think about what the OLS variance formula $\sigma^2(X'X)^{-1}$ assumes about the residuals, and what happens to $X'\Omega X$ when that assumption is violated in two dimensions.
The two-way clustered estimator uses inclusion-exclusion: add the asset-clustered and time-clustered sandwich estimators, then subtract the White (diagonal) estimator to avoid double-counting.
Clustered inference requires the number of clusters $G$ to be large. When the smaller dimension is small, consider a degrees-of-freedom correction using $t_{G-1}$ critical values or switch to a HAC-style estimator like Driscoll-Kraay.

Worked Solution

How to Think About It: This is one of the most practically important topics in empirical finance, and getting it wrong is the single most common source of inflated t-statistics in published research. The core issue is simple: OLS point estimates are fine (they are unbiased regardless of the error structure), but standard errors are only valid if the covariance matrix of the residuals is correctly specified. When you have a panel of asset returns, residuals are correlated in two dimensions -- over time for the same asset (momentum, mean-reversion) and across assets at the same date (common factor exposure). Ignoring either one makes your standard errors too small, which means you reject nulls too often.

Key Insight: The two-way clustered estimator handles arbitrary within-cluster dependence in both dimensions simultaneously, but it requires the number of clusters in each dimension to be large for reliable inference.

The Method:

(i) Why OLS standard errors fail:

The true variance of the OLS estimator $\hat{\beta}$ is:

$\text{Var}(\hat{\beta}) = (X'X)^{-1} X' \Omega X (X'X)^{-1}$

where $\Omega = E[\epsilon \epsilon']$ is the full $NT \times NT$ covariance matrix of residuals. The standard OLS formula assumes $\Omega = \sigma^2 I$, collapsing this to $\sigma^2 (X'X)^{-1}$.

With serial correlation, off-diagonal blocks within each asset's time series are nonzero. With cross-sectional correlation, off-diagonal blocks across assets at the same date are nonzero. Both effects add positive mass to $X' \Omega X$ that the naive formula ignores. The result: OLS standard errors are biased downward, often severely. In typical equity panels, the true standard error can be 3-5x larger than the naive one, turning a "significant" $t = 3$ into an insignificant $t = 0.8$.

Intuitively, correlated residuals reduce the effective sample size. If 500 stocks all move together on the same day, that cross-section is not 500 independent observations -- it might be worth 5.

(ii) Two-way clustered covariance estimator:

The two-way clustered estimator uses the inclusion-exclusion formula:

$\hat{V}_{\text{two-way}} = \hat{V}_{\text{asset}} + \hat{V}_{\text{time}} - \hat{V}_{\text{white}}$

where each component is a sandwich estimator. For clustering by asset:

$\hat{V}_{\text{asset}} = (X'X)^{-1} \left( \sum_{i=1}^{N} X_i' \hat{\epsilon}_i \hat{\epsilon}_i' X_i \right) (X'X)^{-1}$

Here $X_i$ and $\hat{\epsilon}_i$ are the data and residuals for asset $i$ (a $T \times k$ matrix and $T \times 1$ vector). This allows arbitrary correlation across time within asset $i$, but assumes independence across assets.

Symmetrically, $\hat{V}_{\text{time}}$ clusters by date, allowing arbitrary cross-sectional correlation at each $t$ but assuming independence across dates.

The subtraction of $\hat{V}_{\text{white}}$ (the White heteroskedasticity-robust estimator, which is the diagonal-only version) corrects for the double-counting of the diagonal terms.

This estimator is robust to: arbitrary serial correlation within each asset, arbitrary cross-sectional correlation at each date, and heteroskedasticity. It is NOT robust to: dependence between asset $i$ at time $t$ and asset $j$ at time $s$ where $i \neq j$ and $t \neq s$ (i.e., lagged cross-asset dependence). In practice, this "off-diagonal" dependence is usually weak enough to ignore, but it is an assumption.

(iii) Failure mode when $T$ is small:

When $T$ is small (say 20 monthly observations), the time-cluster dimension has only 20 clusters. Clustered standard errors have a finite-sample bias that scales roughly as

/G$ where $G$ is the number of clusters. With only 20 time clusters, the time-clustered component $\hat{V}_{\text{time}}$ is imprecisely estimated and can be severely downward-biased. The usual rule of thumb is that you need at least 40-50 clusters for clustered inference to be reliable.

The resulting t-statistics are over-dispersed relative to the standard normal, meaning the test over-rejects.

Remedy: Use the Cameron-Gelbach-Miller (CGM) small-cluster correction, which applies a degrees-of-freedom adjustment analogous to the Welch-Satterthwaite approximation. Specifically, compare the t-statistic to a $t_{G-1}$ distribution rather than a standard normal, where $G = \min(N, T)$ is the smaller cluster dimension. For $T = 20$, this means using $t_{19}$ critical values (2.09 instead of 1.96 at 5%), which partially corrects for the over-rejection.

Alternatively, if $N$ is large, a Driscoll-Kraay (1998) estimator uses a HAC-style kernel over the time dimension applied to the cross-sectional averages $\bar{e}_t = N^{-1} \sum_i x_{i,t} \hat{\epsilon}_{i,t}$. This requires only that $T$ grows, not that $T$ is large in a clustering sense, and can be more reliable when $T$ is moderate.

Answer: Standard OLS standard errors fail because they assume $\Omega = \sigma^2 I$, ignoring both serial and cross-sectional correlation that inflates $X'\Omega X$. The two-way clustered estimator combines asset-clustered and time-clustered sandwich estimators via $\hat{V}_{\text{two-way}} = \hat{V}_{\text{asset}} + \hat{V}_{\text{time}} - \hat{V}_{\text{white}}$, handling arbitrary within-cluster dependence in both dimensions. When $T$ is small, the time-cluster component is imprecise; use CGM small-cluster corrections or switch to a Driscoll-Kraay HAC estimator.

Intuition

The fundamental issue in panel regression inference is that correlated residuals shrink your effective sample size. If you have 500 stocks observed over 60 months, your nominal sample size is 30,000 -- but if all stocks load on a common factor, each cross-section might carry the information content of just a few independent observations. The two-way clustered estimator is the standard fix because it lets you be agnostic about the within-cluster dependence structure: you don't need to model the correlation, you just need enough clusters for the law of large numbers to kick in on the cluster-level "scores" $X_i' \hat{\epsilon}_i$.

In practice, the binding constraint is almost always the time dimension. Most equity panels have hundreds or thousands of assets but only tens to hundreds of months. This means the cross-sectional clustering works fine, but the time clustering is shaky. This is why Fama-MacBeth (1973) regressions remain so popular in finance -- they handle cross-sectional correlation by construction (you run a cross-sectional regression each month and then do inference on the time series of coefficient estimates), sidestepping the small-$T$ clustering problem entirely. Knowing when to use two-way clustering vs. Fama-MacBeth vs. Driscoll-Kraay is one of the practical judgment calls that separates careful empirical work from naive number-crunching.

Open the full interactive solver →