Wald Test for Sharpe Ratio Equality Under Serial Correlation
You have two daily return series $r_t^A$ and $r_t^B$, both of which may exhibit short-memory serial correlation. Define their sample Sharpe ratios as
$\hat{S}^A = \frac{\bar{r}_A}{\hat{\sigma}_A}, \qquad \hat{S}^B = \frac{\bar{r}_B}{\hat{\sigma}_B}$
where $\bar{r}$ and $\hat{\sigma}$ are the sample mean and standard deviation of each series.
- Using the delta method and HAC (Newey-West) consistent variance estimation, derive a large-sample Wald test for $H_0: S^A = S^B$.
- Give an explicit formula for $\text{Var}(\hat{S}^A - \hat{S}^B)$ in terms of the means, variances, and cross-covariances of the two return series up to lag $L$.
- Explain how using overlapping returns (e.g., multi-day holding periods computed from daily data) affects the HAC kernel choice and bandwidth selection.
Hints
- The Sharpe ratio is a nonlinear function of two moments (mean and variance). What standard technique linearizes a nonlinear function of estimated parameters for asymptotic inference?
- Stack the four quantities $(\bar{r}_A, \hat{\sigma}_A^2, \bar{r}_B, \hat{\sigma}_B^2)$ into a vector, compute the gradient of $g(\theta) = \mu_A/\sigma_A - \mu_B/\sigma_B$, and apply the delta method. The long-run covariance $\Omega$ of this vector captures serial and cross-correlations.
- Write out the influence function $\psi_t^k = (r_t^k - \mu_k)/\sigma_k - (S^k/2)((r_t^k - \mu_k)^2 - \sigma_k^2)/\sigma_k^2$ for each series $k$. The HAC variance of $\hat{S}^A - \hat{S}^B$ is the Newey-West estimator applied to $\hat{\psi}_t^A - \hat{\psi}_t^B$.
Worked Solution
How to Think About It: Comparing Sharpe ratios sounds simple -- just take the difference and check if it is zero. The catch is that each Sharpe ratio is a nonlinear function of two estimated quantities (mean and standard deviation), and the returns themselves may be autocorrelated. So even under the null, the sampling distribution of $\hat{S}^A - \hat{S}^B$ has a variance that depends on serial correlations and cross-correlations of both return series. The delta method converts the problem from "nonlinear function of correlated estimators" into a linear approximation whose variance you can estimate with HAC methods.
Key Insight: The Sharpe ratio $S = \mu / \sigma$ is a function of two moments: $\mu = E[r_t]$ and $\sigma^2 = \text{Var}(r_t)$. Stack all the moments you need, apply the delta method to the function $g(\mu_A, \sigma_A^2, \mu_B, \sigma_B^2) = \mu_A/\sigma_A - \mu_B/\sigma_B$, and then estimate the covariance matrix of the moment vector using a HAC estimator.
Formal Derivation:
*Step 1: Define the moment vector.* Let $\theta = (\mu_A, \sigma_A^2, \mu_B, \sigma_B^2)'$ be the population moments. The sample analogue is $\hat{\theta} = (\bar{r}_A, \hat{\sigma}_A^2, \bar{r}_B, \hat{\sigma}_B^2)'$. By the CLT for stationary processes:
$\sqrt{T}(\hat{\theta} - \theta) \xrightarrow{d} N(0, \Omega)$
where $\Omega$ is the long-run covariance matrix of the moment conditions, incorporating all autocovariances.
*Step 2: Apply the delta method.* Define $g(\theta) = \mu_A / \sigma_A - \mu_B / \sigma_B$ where $\sigma_k = \sqrt{\sigma_k^2}$. Compute the gradient:
$\nabla g = \left(\frac{1}{\sigma_A},\; -\frac{\mu_A}{2\sigma_A^3},\; -\frac{1}{\sigma_B},\; \frac{\mu_B}{2\sigma_B^3}\right)'$
Then under $H_0$:
$\sqrt{T}(\hat{S}^A - \hat{S}^B) \xrightarrow{d} N(0, \nabla g' \, \Omega \, \nabla g)$
*Step 3: The Wald test statistic.* Let $V = \nabla g' \, \hat{\Omega} \, \nabla g$ where $\hat{\Omega}$ is the HAC estimate of $\Omega$. The Wald statistic is:
$W = \frac{T(\hat{S}^A - \hat{S}^B)^2}{V} \xrightarrow{d} \chi^2_1 \quad \text{under } H_0$
Equivalently, $\sqrt{W}$ is asymptotically standard normal, so you can use a two-sided $z$-test.
*Step 4: Explicit variance formula.* To write out $V = \nabla g' \, \Omega \, \nabla g$ explicitly, define the influence function for each observation. Let $\psi_t = (\psi_t^A, \psi_t^B)'$ where:
$\psi_t^A = \frac{r_t^A - \mu_A}{\sigma_A} - \frac{S^A}{2} \cdot \frac{(r_t^A - \mu_A)^2 - \sigma_A^2}{\sigma_A^2}$
and similarly for $\psi_t^B$. Then:
$\text{Var}(\hat{S}^A - \hat{S}^B) = \frac{1}{T} \sum_{l=-L}^{L} k\left(\frac{l}{L}\right) \hat{\Gamma}(l)$
where $\hat{\Gamma}(l) = \frac{1}{T}\sum_{t=|l|+1}^{T} (\hat{\psi}_t^A - \hat{\psi}_t^B)(\hat{\psi}_{t-|l|}^A - \hat{\psi}_{t-|l|}^B)$ is the sample autocovariance of the difference $\hat{\psi}_t^A - \hat{\psi}_t^B$ at lag $l$, and $k(\cdot)$ is the Newey-West (Bartlett) kernel $k(x) = 1 - |x|$ for $|x| \leq 1$.
Expanding $\hat{\Gamma}(l)$ in terms of the original return series, the cross-terms involve:
- $\text{Cov}(r_t^A, r_{t-l}^A)$ and $\text{Cov}(r_t^B, r_{t-l}^B)$ -- the auto-covariances of each series
- $\text{Cov}(r_t^A, r_{t-l}^B)$ and $\text{Cov}(r_t^B, r_{t-l}^A)$ -- the cross-covariances between series
- Higher-order terms involving $(r_t - \mu)^2$ cross-products, which capture how the variance estimates co-move across time
The lag truncation $L$ should grow with $T$ (typically $L \sim T^{1/3}$) to ensure consistency.
*Step 5: Overlapping returns and kernel choice.* When returns are computed over overlapping windows of length $q$ (e.g., weekly returns from daily data using a rolling 5-day sum), the return series acquires an MA($q-1$) structure even if the underlying daily returns are i.i.d. This has two consequences:
- Bandwidth must increase. The induced autocorrelation extends to at least lag $q - 1$, so the bandwidth $L$ must satisfy $L \geq q - 1$. Data-driven bandwidth selectors (Andrews 1991, Newey-West 1994) will automatically pick a larger $L$, but if using a fixed rule, you must account for the overlap.
- Kernel choice matters more. The Bartlett kernel guarantees a positive semi-definite $\hat{\Omega}$ but has slower convergence. With heavily overlapping returns (large $q$), the Quadratic Spectral or Parzen kernels offer better bias-variance tradeoffs because they downweight distant lags more smoothly. In practice, the Bartlett kernel with $L = q - 1$ (or a small multiple thereof) is a reasonable default for moderate overlap.
- Variance scaling. Under overlap, the effective sample size is closer to $T/q$ than $T$, so confidence intervals widen roughly by a factor of $\sqrt{q}$. Many practitioners forget this and over-reject.
Answer: The Wald test statistic is $W = T(\hat{S}^A - \hat{S}^B)^2 / V$ where $V = \nabla g' \hat{\Omega} \nabla g$, with $\nabla g = (1/\sigma_A, -S^A/(2\sigma_A^2), -1/\sigma_B, S^B/(2\sigma_B^2))'$ and $\hat{\Omega}$ the Newey-West HAC estimator of the long-run covariance of $(\bar{r}_A, \hat{\sigma}_A^2, \bar{r}_B, \hat{\sigma}_B^2)$. Under $H_0$, $W \to \chi^2_1$. The variance of the Sharpe difference is computed via the influence function approach, summing kernel-weighted autocovariances of $\hat{\psi}_t^A - \hat{\psi}_t^B$ up to lag $L$. Overlapping returns inflate the required bandwidth to at least the overlap length and favor smoother kernels.
Intuition
At its core, this problem is about the fact that comparing two estimated ratios is harder than comparing two estimated means. Each Sharpe ratio is a ratio of two estimated quantities, so even the sampling distribution of a single Sharpe ratio is complicated -- it depends on the skewness and kurtosis of returns, not just the mean and variance. When you add serial correlation on top, the effective sample size shrinks and the covariance structure blows up. The delta method is the workhorse that tames this complexity: linearize the nonlinear function, then let HAC estimation handle the temporal dependence.
This comes up constantly in practice when evaluating whether one strategy genuinely outperforms another. The naive approach -- compute both Sharpe ratios and eyeball the difference -- ignores estimation error and serial dependence, leading to wildly overconfident conclusions. The Ledoit-Wolf (2008) and Opdyke (2007) papers formalize exactly this test. The overlapping-returns twist is especially relevant because portfolio managers often report monthly or quarterly Sharpe ratios computed from daily data, and failing to adjust the HAC bandwidth for the induced autocorrelation is one of the most common errors in performance attribution.