You observe transaction prices $P_{t_i}$ at irregular times throughout a trading day and want to estimate the daily integrated variance $IV = \int_0^T \sigma^2_t \, dt$.
Define realized variance based on regularly sampled log prices. What happens to this estimator as you increase the sampling frequency in the presence of bid-ask bounce?
Describe a noise-robust estimator -- pick one of: realized kernel, pre-averaging, or subsampling. Explain at a high level how it works and identify the key tuning parameter you must choose.
How would you select that tuning parameter in practice using data-driven diagnostics?
Hints
Think about what happens to each squared return when observed prices contain additive noise -- how does the noise contribution scale with the number of observations?
The realized kernel estimator generalizes $RV$ by including autocovariances at multiple lags, weighted by a kernel function. The bandwidth $H$ controls how many lags to include.
To choose the bandwidth in practice, estimate $\omega^2$ from the first-order return autocovariance and use the plug-in formula $H^{*} \propto n^{3/5}$, calibrated by the noise-to-signal ratio.
Worked Solution
How to Think About It: This is one of the most important practical problems in quantitative finance. Every vol trader, risk manager, and systematic strategy that uses intraday data has to deal with this. The core tension is simple: you want to sample as frequently as possible to get a precise volatility estimate, but high-frequency prices are contaminated by market microstructure effects (bid-ask bounce, discrete tick sizes, latency). If you naively sample too fast, your volatility estimate explodes. If you sample too slowly, you throw away information. The whole field of realized volatility estimation is about navigating this trade-off.
Key Insight: Microstructure noise induces a bias in realized variance that grows linearly with the number of observations. A noise-robust estimator must either filter the noise (pre-averaging, kernel smoothing) or cancel it (subsampling), and each method requires a tuning parameter that balances bias against variance.
The Method:
*Part (i): Realized Variance and the Noise Problem*
Sample log prices at $n$ equally spaced times $t_0, t_1, \ldots, t_n$ and compute log returns $r_i = \log P_{t_i} - \log P_{t_{i-1}}$. The realized variance is:
$RV_n = \sum_{i=1}^{n} r_i^2$
In a frictionless world (no noise), as $n \to \infty$, $RV_n$ converges in probability to the integrated variance $IV$ -- this is the quadratic variation result.
Now add bid-ask bounce. Model the observed price as $P^{*}_{t_i} = P_{t_i} + \epsilon_i$, where $\epsilon_i$ is i.i.d. noise with mean zero and variance $\omega^2$. Each observed return picks up a noise component $\epsilon_i - \epsilon_{i-1}$, which contributes
\omega^2$ in expectation to each squared return. So:
$E[RV_n] \approx IV + 2n\omega^2$
As $n$ increases, the
n\omega^2$ term dominates. The realized variance diverges linearly with sampling frequency. You also see the signature: returns at very high frequency exhibit negative first-order autocorrelation (because $\epsilon_i - \epsilon_{i-1}$ and $\epsilon_{i+1} - \epsilon_i$ share $\epsilon_i$ with opposite signs).
where $\hat{\gamma}_h = \sum_{i} r_i \, r_{i+h}$ is the sample autocovariance at lag $h$, and $H$ is the bandwidth.
The intuition: RV only uses the lag-0 autocovariance ($\hat{\gamma}_0 = \sum r_i^2$). But the noise creates spurious negative autocorrelation at lag 1. By including nearby autocovariances with appropriate weights, the kernel "cancels out" the noise bias. Common kernel choices include the Parzen and Tukey-Hanning kernels, which ensure the estimator is positive semidefinite.
The key tuning parameter is the bandwidth $H$: - Too small ($H \to 0$): you get back RV, which is biased upward. - Too large ($H \to n$): you over-smooth and increase variance. - The optimal bandwidth balances bias and variance. For the Parzen kernel, the MSE-optimal rate is $H \propto n^{3/5}$.
*Part (iii): Data-Driven Bandwidth Selection*
The practical approach uses the volatility signature plot and noise estimation:
Volatility signature plot: Compute $RV_n$ at many sampling frequencies (e.g., 1 sec, 5 sec, 15 sec, ..., 30 min). Plot $RV_n$ vs. sampling interval. At very high frequency, $RV$ is inflated by noise. As sampling interval increases, $RV$ drops and eventually plateaus near the true $IV$. The shape of this curve tells you about the noise level.
Estimate the noise variance $\omega^2$: Use the relation $E[RV_n] \approx IV + 2n\omega^2$. Regress $RV_n$ on $n$ across different subsampling frequencies, or use $\hat{\omega}^2 = -\hat{\gamma}_1$ (the negative of the first-order autocovariance of high-frequency returns).
Plug-in bandwidth: Given $\hat{\omega}^2$ and a pilot estimate of $IV$ (from a moderate sampling frequency), compute the MSE-optimal bandwidth:
$H^{*} = c \cdot \xi^{4/5} \cdot n^{3/5}$
where $\xi^2 = \omega^2 / \sqrt{IQ}$ is the noise-to-signal ratio and $c$ depends on the kernel choice. $IQ$ (integrated quarticity) can be estimated from the data using a robust quarticity estimator.
Cross-validation: As a robustness check, split the day into sub-periods, estimate $H$ on each, and check for stability. If the optimal bandwidth varies wildly across sub-periods, the noise model may be misspecified (e.g., noise may be correlated or time-varying).
Practical Considerations:
The i.i.d. noise assumption is a simplification. Real microstructure noise is often correlated (e.g., due to stale quotes or order flow dynamics). The realized kernel handles mild dependence, but heavy dependence may require more lags.
Pre-averaging is an alternative that is often easier to implement: average returns over overlapping blocks of size $K$, then compute a realized variance from the averaged returns. The tuning parameter is $K$, with optimal rate $K \propto n^{1/2}$.
In practice, many desks use the "5-minute rule" as a simple default: sample at 5-minute intervals, which is coarse enough to avoid most noise but fine enough to get a reasonable estimate. The formal methods above let you do better, especially for liquid assets where sampling at 5-15 seconds with noise correction beats 5-minute RV.
Answer: Realized variance $RV_n = \sum r_i^2$ is a consistent estimator of integrated variance in the absence of noise, but diverges as $O(n)$ in the presence of bid-ask bounce. The realized kernel estimator corrects this by including weighted autocovariances with bandwidth $H$, optimally chosen at rate $n^{3/5}$ via plug-in methods that estimate the noise variance from the data (e.g., from the first-order autocovariance of high-frequency returns or from the volatility signature plot).
Intuition
The fundamental tension in realized volatility estimation is an instance of a broader principle: more data is not always better when the data is noisy. In a frictionless market, quadratic variation gives you integrated variance for free as you sample faster. But real markets have friction -- every trade crosses a spread, prices tick in discrete increments, and quotes go stale. These effects add a noise component to each observed return, and since you square returns to get variance, the noise contribution accumulates linearly with the number of observations. This is why the "volatility signature plot" is one of the first diagnostics any practitioner learns: plot realized variance against sampling frequency and watch it blow up on the left side.
The noise-robust estimators (realized kernel, pre-averaging, subsampling) all share the same core idea: use the autocorrelation structure of observed returns to separate signal from noise. In a frictionless world, high-frequency returns are approximately uncorrelated. Microstructure noise introduces negative autocorrelation at short lags. By measuring and correcting for this autocorrelation, you can recover an estimate much closer to the true integrated variance. The practical skill is not choosing the "right" estimator -- they all converge at similar rates -- but tuning the bandwidth or block size sensibly. A data-driven approach (estimate the noise, plug into the optimal formula, sanity-check with the signature plot) is far more reliable than picking an arbitrary sampling frequency like "just use 5-minute bars."