Feature Selection for Return Prediction

Machine Learning · Medium · Free problem

You have $p$ candidate features and a limited sample of $n$ daily observations for predicting next-day stock returns. Design a feature-selection and validation protocol that addresses all of the following:

Overfitting control -- how do you keep $p$ large and $n$ small from blowing up your model?

Multiple comparisons -- you are testing many features. How do you avoid fooling yourself with lucky ones?

Trading-relevant evaluation -- your out-of-sample metric should be something a PM actually cares about (e.g., Sharpe ratio or information ratio), not just $R^2$.

Your protocol must specify:

The train/test split scheme (e.g., walk-forward, expanding window)
The selection criterion (e.g., penalized regression, stability selection)
How you control selection bias (e.g., nested cross-validation)
How you report uncertainty around your performance estimates (e.g., block bootstrap confidence intervals)

Hints

Think about why standard K-fold cross-validation fails for time series, and what the right analog is when observations are not exchangeable.
Consider using stability selection on top of LASSO -- a feature that only shows up in a small fraction of subsamples is probably noise, not signal.
Your protocol needs two nested loops: an inner loop for model/feature selection and an outer loop for honest performance estimation. Make sure the outer test data never touches any selection decision.

Worked Solution

How to Think About It: This is a question about building a disciplined research pipeline -- the kind of thing that separates quant teams that produce real alpha from ones that just overfit noise. The core tension is simple: you have way more things to try than you have data to test them on, and the data has serial dependence that breaks standard statistical tools. Every shortcut -- peeking at test data, using standard K-fold, ignoring multiple testing -- leads to the same outcome: a strategy that looks great in backtest and bleeds money live. The interviewer wants to see that you understand why each piece of the pipeline exists, not just that you can name the techniques.

Key Insight: The entire protocol is built around one principle: every decision you make (which features, which hyperparameters, which model) must be evaluated on data that was never used to make that decision. This means nested loops: an inner loop for model selection and an outer loop for honest performance estimation.

The Method:

Walk-forward split scheme. Divide your timeline into sequential blocks. For each evaluation point $t$, train on all data before $t - g$ (where $g$ is a gap/embargo period to prevent leakage from autocorrelated returns), and test on the block starting at $t$. Expand the training window as you move forward. Never shuffle time. The gap $g$ should be at least as long as your features' lookback window -- if you use 20-day momentum as a feature, embargo at least 20 days.

2. Nested cross-validation for selection bias control. Within each training fold from Step 1, run an inner walk-forward loop to select features and tune hyperparameters. Concretely: - Inner loop: further split the training data into sub-train and sub-validation (again walk-forward, respecting time order). - Use the inner loop to choose your feature set and regularization strength. - Refit on the full training fold with the chosen features, then evaluate on the held-out outer test fold. - This ensures your reported performance is never contaminated by selection decisions.

3. Feature selection criterion. Use one of: - LASSO / Elastic Net: Penalized regression with $L_1$ regularization naturally zeros out weak features. Tune $\lambda$ on the inner loop using the trading metric (not MSE). Elastic Net is preferred when features are correlated (which they usually are in finance). - Stability selection: Run LASSO at many subsamples and regularization levels. Keep only features selected in more than a threshold fraction (e.g., 60-80%) of runs. This is more conservative and controls false discovery -- a feature that only shows up in 20% of subsamples is probably noise. - In practice, stability selection on top of LASSO is the gold standard for controlling false positives.

Multiple comparisons adjustment. Even after penalized regression, if you test many model specifications, you need to correct for the number of configurations tried. Apply a Bonferroni or Benjamini-Hochberg correction to any hypothesis tests on feature significance. For overall strategy evaluation, use the method of White (2000) -- the Reality Check bootstrap or Hansen's SPA test -- which directly tests whether the best-performing specification is significantly better than a benchmark after accounting for the full search.

Trading-relevant metric. Evaluate on out-of-sample Sharpe ratio or information ratio, not $R^2$ or RMSE. In alpha research, $R^2$ of 1-2% can be extremely valuable if it is stable. The OOS Sharpe ratio captures both signal strength and consistency. If you want to get fancy, report the information coefficient (IC) -- the rank correlation between predicted and realized returns -- averaged across time, which is more robust to outliers.

6. Uncertainty quantification via block bootstrap. To report confidence intervals on your OOS Sharpe (or IC): - Use a stationary block bootstrap (Politis and Romano, 1994) on the time series of daily OOS returns. - Draw $B = 1000$+ bootstrap samples, compute the Sharpe ratio on each, and report the 2.5th-97.5th percentile interval. - Block length should match the autocorrelation structure of your returns (typically 5-20 days for daily data; use an automatic selection rule like Politis-White). - This respects temporal dependence and gives you honest uncertainty bands.

Practical Considerations:

Data snooping is cumulative. Even if each individual step is clean, the fact that you tried 5 different feature sets, 3 models, and 4 regularization schemes means you have effectively tested 60 configurations. Track everything you try.
Transaction costs matter. A high-turnover feature set might have a great gross Sharpe but terrible net Sharpe. Evaluate net of estimated costs.
Regime dependence. A feature that works in low-vol regimes might fail in crises. Walk-forward handles this somewhat, but inspect performance by regime.
Beware short test periods. If each OOS fold is only 20 days, your Sharpe estimates are extremely noisy. You need at least 6-12 months of OOS data per fold to say anything meaningful.
Feature preprocessing. Standardize features within each training window (z-score using training-window mean and std). Never use full-sample statistics -- that is information leakage.

Answer: The recommended protocol is: (1) walk-forward expanding-window splits with an embargo gap, (2) nested walk-forward CV where the inner loop selects features (via stability selection or LASSO) and tunes hyperparameters, (3) outer loop evaluates on OOS Sharpe or information ratio with multiple-testing corrections (White's Reality Check or Hansen's SPA test), and (4) block bootstrap confidence intervals on the OOS performance metric to quantify uncertainty. The key discipline is that no decision -- feature choice, hyperparameter, model specification -- is ever evaluated on data that influenced that decision.

Intuition

The deeper lesson here is that in quantitative finance, the research process itself is a source of overfitting -- and no single technique fixes it. LASSO alone does not save you if you try 50 variants of LASSO and pick the best one. Walk-forward alone does not save you if you peek at the OOS results and adjust your approach. The only real defense is a disciplined pipeline where every decision is evaluated on data it never saw, and the total number of decisions is honestly accounted for. This is why serious quant firms track every experiment in a research log and apply multiple-testing corrections to the full history of things they tried, not just the final model.

In practice, most alpha signals in equities have out-of-sample $R^2$ in the low single digits. This means your feature selection protocol has to be ruthlessly conservative -- a feature that adds 0.5% of $R^2$ but is not stable across subsamples and time periods will cost you money after transaction costs and slippage. The firms that do this well tend to use stability selection or similar consensus methods, evaluate on IC or Sharpe rather than statistical significance, and treat the entire pipeline as a hypothesis to be tested rather than a model to be optimized.

Open the full interactive solver →