You observe transaction prices $P_{t_i}$ at irregular times throughout a trading day and want to estimate the daily integrated variance $IV = \int_0^T \sigma^2_t \, dt$. 1. Define realized variance based on regularly sampled log prices. What happens to this estimator as you increase the sampling fr…

Realized Volatility and Microstructure Noise

Time Series · Hard · Free problem

You observe transaction prices $P_{t_i}$ at irregular times throughout a trading day and want to estimate the daily integrated variance $IV = \int_0^T \sigma^2_t \, dt$.

Define realized variance based on regularly sampled log prices. What happens to this estimator as you increase the sampling frequency in the presence of bid-ask bounce?

Describe a noise-robust estimator -- pick one of: realized kernel, pre-averaging, or subsampling. Explain at a high level how it works and identify the key tuning parameter you must choose.

How would you select that tuning parameter in practice using data-driven diagnostics?

Hints

Think about what happens to each squared return when observed prices contain additive noise -- how does the noise contribution scale with the number of observations?
The realized kernel estimator generalizes $RV$ by including autocovariances at multiple lags, weighted by a kernel function. The bandwidth $H$ controls how many lags to include.
To choose the bandwidth in practice, estimate $\omega^2$ from the first-order return autocovariance and use the plug-in formula $H^{*} \propto n^{3/5}$, calibrated by the noise-to-signal ratio.

Worked Solution

How to Think About It: This is one of the most important practical problems in quantitative finance. Every vol trader, risk manager, and systematic strategy that uses intraday data has to deal with this. The core tension is simple: you want to sample as frequently as possible to get a precise volatility estimate, but high-frequency prices are contaminated by market microstructure effects (bid-ask bounce, discrete tick sizes, latency). If you naively sample too fast, your volatility estimate explodes. If you sample too slowly, you throw away information. The whole field of realized volatility estimation is about navigating this trade-off.

Key Insight: Microstructure noise induces a bias in realized variance that grows linearly with the number of observations. A noise-robust estimator must either filter the noise (pre-averaging, kernel smoothing) or cancel it (subsampling), and each method requires a tuning parameter that balances bias against variance.

The Method:

*Part (i): Realized Variance and the Noise Problem*

Sample log prices at $n$ equally spaced times $t_0, t_1, \ldots, t_n$ and compute log returns $r_i = \log P_{t_i} - \log P_{t_{i-1}}$. The realized variance is:

$RV_n = \sum_{i=1}^{n} r_i^2$

In a frictionless world (no noise), as $n \to \infty$, $RV_n$ converges in probability to the integrated variance $IV$ -- this is the quadratic variation result.

Now add bid-ask bounce. Model the observed price as $P^{*}_{t_i} = P_{t_i} + \epsilon_i$, where $\epsilon_i$ is i.i.d. noise with mean zero and variance $\omega^2$. Each observed return picks up a noise component $\epsilon_i - \epsilon_{i-1}$, which contributes

Intuition

The fundamental tension in realized volatility estimation is an instance of a broader principle: more data is not always better when the data is noisy. In a frictionless market, quadratic variation gives you integrated variance for free as you sample faster. But real markets have friction -- every trade crosses a spread, prices tick in discrete increments, and quotes go stale. These effects add a noise component to each observed return, and since you square returns to get variance, the noise contribution accumulates linearly with the number of observations. This is why the "volatility signature plot" is one of the first diagnostics any practitioner learns: plot realized variance against sampling frequency and watch it blow up on the left side.

The noise-robust estimators (realized kernel, pre-averaging, subsampling) all share the same core idea: use the autocorrelation structure of observed returns to separate signal from noise. In a frictionless world, high-frequency returns are approximately uncorrelated. Microstructure noise introduces negative autocorrelation at short lags. By measuring and correcting for this autocorrelation, you can recover an estimate much closer to the true integrated variance. The practical skill is not choosing the "right" estimator -- they all converge at similar rates -- but tuning the bandwidth or block size sensibly. A data-driven approach (estimate the noise, plug into the optimal formula, sanity-check with the signature plot) is far more reliable than picking an arbitrary sampling frequency like "just use 5-minute bars."