HV-Block Cross-Validation for Dependent Data

Machine Learning · Hard · Free problem

You are building a model to forecast $r_{t+1}$ (next-period returns) using features that may include overlapping or lagged variables. Standard K-fold cross-validation is inappropriate because the data is serially correlated.

  1. Design. Describe an hv-block cross-validation scheme. In this scheme, the training set for each fold excludes $h$ observations before and $v$ observations after the validation block to prevent information leakage. Draw or describe the structure clearly.
  1. Parameter choices. How should $h$ and $v$ be chosen in terms of the dependence structure of the data? What happens if they are too small or too large?
  1. Bias-variance trade-offs. Compare the bias and variance of out-of-sample error estimates from hv-block CV to those from naive K-fold CV. Why does naive K-fold fail, and what does the blocking fix?

Hints

  1. Naive K-fold treats observations as exchangeable. What goes wrong when neighboring observations in time carry nearly the same information?
  2. The fix is to create a dead zone around each validation block. How wide should this zone be, and what determines the right width?
  3. The exclusion parameters $h$ (before) and $v$ (after) should be at least as large as the autocorrelation range $L$ of the data. Too small means residual leakage; too large means you waste training data and increase variance.

Worked Solution

How to Think About It: The fundamental problem with naive K-fold CV on time series is information leakage. If observation $t$ is in the validation fold and observation $t-1$ is in the training fold, the model effectively gets to peek at near-future information through autocorrelation. This makes the CV error estimate optimistically biased -- your model looks better in backtesting than it will perform live. The hv-block scheme fixes this by creating a buffer zone (embargo) around each validation block, ensuring that no training observation is close enough in time to leak information.

Key Insight: The buffer sizes $h$ and $v$ should match the range of temporal dependence in the data. If the autocorrelation in your features or returns dies out after $L$ lags, you need $h \ge L$ and $v \ge L$ to fully prevent leakage.

The Method:

Part 1 -- The HV-Block Scheme:

Suppose you have $T$ observations indexed $t = 1, \ldots, T$ and you split them into $K$ contiguous validation blocks $B_1, \ldots, B_K$.

For fold $k$ with validation block $B_k = \{t_k, t_k+1, \ldots, t_k + w - 1\}$ (where $w$ is the block width): - Excluded zone before: observations $\{t_k - h, \ldots, t_k - 1\}$ are removed from training - Validation block: observations $B_k$ are used for evaluation - Excluded zone after: observations $\{t_k + w, \ldots, t_k + w + v - 1\}$ are removed from training - Training set: all remaining observations outside the validation block and both exclusion zones

Schematically for one fold: $[\text{Train}] \;\underbrace{|\; \text{gap } h \;}_{\text{excluded}} |\; \underbrace{\text{Val block}}_{\text{evaluate}} \;| \underbrace{\;\text{gap } v \;|}_{\text{excluded}} [\text{Train}]$

The key difference from walk-forward validation: hv-block CV uses training data on both sides of the validation block (past and future), whereas walk-forward only trains on past data. HV-block is appropriate when your goal is to estimate generalization error; walk-forward is appropriate when you want to simulate realistic trading.

Part 2 -- Choosing $h$ and $v$:

Set $h$ and $v$ based on the autocorrelation range of the data: - Compute the autocorrelation function (ACF) of the features and the response $r_{t+1}$ - Let $L$ be the lag at which the ACF becomes negligible (e.g., drops below

/\sqrt{T}$) - Set $h \ge L$ and $v \ge L$

For overlapping features specifically: - If you use $k$-day overlapping returns as a feature to predict 1-day returns, the overlap induces autocorrelation of order $k-1$ in the features. Set $h \ge k - 1$. - The $v$ parameter handles forward leakage: if $r_{t+1}$ is correlated with features at time $t + v$, you need $v$ large enough to break that link.

If $h$ and $v$ are too small: leakage persists, and the CV error estimate is still optimistically biased (though less than naive K-fold). If $h$ and $v$ are too large: you discard too much training data in each fold, increasing variance of the CV estimate and potentially introducing pessimistic bias (less training data means worse model fits).

Part 3 -- Bias-Variance Comparison with Naive K-Fold:

| Aspect | Naive K-Fold | HV-Block CV | |---|---|---| | Bias | Optimistic (too low) -- training data is correlated with validation data, model appears to generalize better than it does | Approximately unbiased if $h, v \ge L$ -- buffer zones break the dependence between training and validation | | Variance | Artificially low -- folds are correlated, so averaging across folds underestimates true variability | Higher variance -- fewer effective training points per fold due to exclusion zones, and fewer independent folds | | Net effect | Dangerously overconfident -- both bias and variance are wrong in the optimistic direction | Honest but noisier -- the estimate is centered correctly but has more spread |

Why naive K-fold fails: when observation $t$ is in the validation set, nearby observations $t \pm 1, t \pm 2, \ldots$ in the training set carry nearly the same information. The model effectively memorizes the validation data through autocorrelation, producing a CV error that is much lower than true out-of-sample error. This is exactly the same phenomenon that causes backtested Sharpe ratios to be inflated.

Practical Considerations: - In practice, $v$ is often more important than $h$ for financial forecasting, because the main leakage channel is the model using future returns (embedded in overlapping features) to predict the current return - Marcos Lopez de Prado's "purged K-fold" is a closely related idea with the same motivation - If $L$ is very large (e.g., monthly autocorrelation in macro factors), hv-block CV may discard so much data that walk-forward validation becomes preferable

Answer: HV-block CV creates buffer zones of $h$ lags before and $v$ lags after each validation block, sized to match the autocorrelation range. This eliminates the optimistic bias of naive K-fold (which leaks information through serial correlation) at the cost of higher variance in the error estimate. Choose $h, v \ge L$ where $L$ is the effective autocorrelation length of the data.

Intuition

HV-block CV is the time-series analyst's version of the same insight behind purging and embargoing in financial backtesting: if your training and test data are correlated, your performance estimates are lying to you. The buffer zones serve the same function as an embargo period in a backtest -- they create a clean separation between what the model has seen and what it is being evaluated on. The bias-variance trade-off in choosing $h$ and $v$ mirrors a fundamental tension in all of applied statistics: more aggressive debiasing (larger buffers) means less data and noisier estimates. In practice, getting the buffer sizes right matters enormously -- a buffer that is too small by even one lag can let enough leakage through to make a worthless strategy look profitable.

Open the full interactive solver →