Mid-Price Direction Forecasting from Limit Order Book Data

Machine Learning · Hard · Free problem

You are given a 45-minute take-home data challenge. The dataset contains event-level limit order book (LOB) snapshots for a single symbol: each row has a timestamp, best bid price, best ask price, bid size and ask size at the top of book, and trade size (zero for quote updates, nonzero for trades).

Your task is to build a binary classifier that predicts whether the mid-price -- defined as $(\text{bid} + \text{ask}) / 2$ -- will be higher or lower 20 events from now. Walk through the following:

What features would you construct from the raw LOB data, and why?
How do you set up a proper train/validation/test split that avoids lookahead bias?
What model would you choose given the 45-minute constraint, and what are the main failure modes?
How would you adapt this offline model for real-time streaming deployment?

Hints

The biggest risk in this problem is not model choice -- it is letting future information leak into your training features or your validation fold. Think carefully about what information is available at event $t$ versus what only becomes known later.
Order book imbalance -- $(\text{bid\_size} - \text{ask\_size}) / (\text{bid\_size} + \text{ask\_size})$ -- is the single most predictive causal feature for short-horizon mid-price direction. Start there, then layer in signed trade flow.
For walk-forward CV, impose an embargo of at least 20 events between your training cutoff and the start of each test window -- because the label for the last event in training depends on the next 20 events, which overlap with your test set.

Worked Solution

How to Think About It: The naive approach -- treat each event as an i.i.d. sample and run K-fold CV -- is catastrophically wrong here. LOB data is serially correlated and non-stationary. If you train on events from 2:00pm and test on 1:00pm events in the same fold, you have looked at the future. Worse, the mid-price label for event $t$ depends on events $t+1$ through $t+20$, so the feature window and label window overlap across nearby events. The entire problem is really three subproblems glued together: feature engineering (what to compute), validation design (how to measure performance honestly), and deployment (how to run this live without blowing up latency).

Key Insight: The label for each event is a function of future data. This means any feature constructed from a window ending at event $t$ is safe -- but any accidental forward reference (e.g., normalizing by the session mean, which is not known until end of day) will poison your model and make backtested accuracy meaningless.

The Method:

Define the target cleanly. For event $t$, compute $m_t = (\text{bid}_t + \text{ask}_t) / 2$ and $m_{t+20}$. Label is $y_t = \mathbf{1}[m_{t+20} > m_t]$. Handle ties (no change) either by dropping them or assigning a separate class. Ties are common when the spread is wide and mid is stable -- inspect the class balance before modeling.

2. Construct features using only causal information. Good feature families: - Order book imbalance: $\text{OBI}_t = (\text{bid\_size}_t - \text{ask\_size}_t) / (\text{bid\_size}_t + \text{ask\_size}_t)$. Strong short-horizon predictor -- if there is more size on the bid, buyers are more aggressive. - Trade flow: signed trade size over the last $k$ events (positive for buyer-initiated, negative for seller-initiated). Requires inferring aggressor side via the Lee-Ready rule: if trade price $\geq$ ask, buyer-initiated; if $\leq$ bid, seller-initiated. - Spread: $(\text{ask}_t - \text{bid}_t)$ normalized by mid. Wide spreads indicate uncertainty and often precede larger moves. - Mid-price momentum: rolling change in mid over the last 5, 10, 20 events. Captures short-term autocorrelation. - Trade intensity: number of trades in last $k$ events. Elevated activity often precedes directional moves. - Avoid features that require future data: session-level z-scores, end-of-day returns, etc.

3. Set up a walk-forward validation scheme. Never shuffle. The correct structure: - Sort events by timestamp. - Choose a burn-in window (e.g., first 20% of events) to train your initial model. Do not test on this. - Walk forward in chunks: train on events $[0, T_1]$, test on $[T_1 + \text{gap}, T_1 + \text{gap} + \Delta]$, then expand and repeat. - The gap matters. Events at the seam between train and test are contaminated -- the label for event $T_1$ involves events up to $T_1 + 20$, so at minimum embargo 20 events before the test window. A larger embargo (50-100 events) is safer if features use long lookback windows. - Report accuracy, log-loss, and directional PnL (does correct classification actually make money?) across all out-of-sample folds.

4. Choose a model appropriate for 45 minutes. Do not attempt a deep learning model -- you will not tune it in time. Reasonable choices: - Gradient boosted trees (XGBoost/LightGBM): Fast to train, handles nonlinear interactions, robust to correlated features. Default choice. - Logistic regression with regularization: Interpretable, fast, good baseline. Use $L_2$ regularization; $L_1$ if you want sparse features. - Random forest: Slightly slower than GBM but less prone to overfitting on small datasets. - Set aside 15 minutes to inspect feature importances and sanity-check predictions before presenting.

5. Flag common failure modes: - Lookahead in normalization: If you normalize features by session statistics (mean, std) computed over the full day, you have leaked the future. Always use expanding-window or rolling statistics. - Stale labels near session edges: The last 20 events have no valid label (the future does not exist). Drop them. - Non-stationarity: The LOB regime at open is different from midday. Recency-weighted training or regime-aware models help. - Transaction costs: A model with 52% directional accuracy sounds good -- but if the spread is 2 basis points and your signal is worth 0.5 bps, you lose money every trade. Always sanity-check against the spread.

6. Adapt for real-time deployment. The online model must produce a prediction at every new event within a hard latency budget (typically sub-millisecond for HFT, sub-second for signal-based systematic): - Precompute incremental feature updates -- do not recompute rolling sums from scratch on each tick. Maintain a circular buffer of the last $k$ events and update the sum as new events arrive. - Serialize the trained model (pickle, ONNX, or native binary) and load it at startup. Do not retrain in the hot path. - Gate on data quality: if the feed drops events or the spread is abnormal, suppress predictions rather than serving stale or garbage inputs. - Log every prediction and the subsequent outcome for ongoing monitoring. Concept drift in LOB data is fast -- a model trained on Monday morning may be stale by Friday afternoon.

Practical Considerations: The 45-minute constraint forces prioritization. In a real interview, spending 30 minutes on perfect feature engineering and 5 minutes on broken validation is a red flag -- the interviewer is watching whether you know where the risks are. Validation correctness is non-negotiable; feature sophistication is nice-to-have. A logistic regression with a correct walk-forward scheme beats a neural net with K-fold CV every time.

Answer: Build a binary classifier (GBM or logistic regression preferred) on causal LOB features -- order book imbalance, signed trade flow, spread, and mid-price momentum. Use walk-forward cross-validation with an embargo of at least 20 events at each fold boundary. Avoid any feature normalization that uses future data. Deploy with incremental feature computation, a serialized model, and a data-quality gate. Monitor for concept drift.

Intuition

The core tension in any supervised learning problem built on financial time series is causality: your labels are defined by future prices, so the boundary between what you know and what you are predicting is sharp, and crossing it accidentally is trivially easy. The most common mistake practitioners make is not using a fancy wrong model -- it is using correct-looking code that silently references the future. Session-level normalization, improper fold boundaries, and sliding windows that straddle the train/test seam are the usual culprits. The model is almost irrelevant compared to this.

The second lesson is that short-horizon LOB forecasting is genuinely hard in a signal-to-noise sense. Mid-price over 20 events is partially predictable (order book imbalance has real predictive content), but accuracy of 53-55% is a realistic ceiling for most signals at this horizon -- and that narrow edge evaporates quickly if transaction costs are not accounted for. The practitioner mindset is to frame the question not as 'can I predict direction?' but as 'does my predicted edge exceed the cost of acting on it?' That reframing connects the ML model to actual profitability and is what separates a quant interview answer from a data science homework answer.

Open the full interactive solver →