Predicting Closing-Auction Volume from the Day's Order Flow
For each stock-day you are given a pandas DataFrame sampled through the continuous session:
| column | meaning | |---|---| | time | intraday timestamp (e.g. 1-minute bars) | | traded_volume | shares traded in the bar | | signed_order_flow | buy-initiated minus sell-initiated volume |
plus the exchange closing-imbalance feed — a short series of (time, imbalance_shares, imbalance_side) updates published in the last ~10 minutes before the close. Your target is the number of shares that cross in the closing auction (the official closing print size), which the exchange reports after the close.
Discuss your approach:
- The auction print is dominated by a handful of huge days. Pull up a histogram of closing-auction volume across days — what do you notice, and how does it shape your loss function and features?
- What features would you build from the intraday flow and from the imbalance feed?
- You must predict *before* the close, using only information available at decision time. What leaks if you are not careful?
Hints
- Closing-auction volume is heavily right-skewed and spans orders of magnitude. Modelling log-volume, or volume as a fraction of the day's total, tames it.
- The biggest auctions are not random: index-rebalance days, quarterly expiries (triple witching), and month-end cluster on specific known dates. A calendar feature is doing real work here.
- The imbalance feed is your strongest same-day signal, but it arrives late and is revised. Use only the snapshot available at your prediction time, and beware that the final imbalance is itself a function of the auction you are predicting.
Worked Solution
How to Think About It: Closing-auction volume is not a smooth function of intraday flow — it is a fat-tailed quantity whose extremes are *scheduled*. The right framing is: a baseline level set by the stock's liquidity and recent regime, multiplied/added to by known event effects, with the late imbalance feed as the sharpest same-day refinement. Get the target transform and the leakage discipline right and the modelling is straightforward.
Key Insight: The predictable tail is a calendar. Index reconstitution days, quarterly triple-witching, and month-end rebalancing produce the giant auctions, clustered on dates you know in advance. A flow-only model cannot see them.
The Method: 1. Transform the target. Model $\log(\text{auction shares})$ or auction shares as a fraction of the day's total volume — both compress the heavy tail and stabilise the loss. 2. Flow features. Day's cumulative volume vs. its trailing median (ADV ratio), volume-profile shape (how back-loaded the day is), late-session flow acceleration, realised volatility, and the sign/persistence of signed_order_flow. 3. Event features. A calendar of index-rebalance effective dates, expiry Fridays, and month/quarter-end; plus an "is this stock in an index being reconstituted" flag. These carry most of the tail. 4. Imbalance-feed features. From the *snapshot available at prediction time*: current imbalance shares (signed), its growth rate across updates, and imbalance as a fraction of ADV. 5. Model & validate. Gradient-boosted trees on the log target; validate with a strictly time-ordered split (train on past, test on future) — never shuffle.
Practical Considerations: The cardinal sin is look-ahead leakage: the *final* imbalance is mechanically tied to the auction you are predicting, so only use imbalance as of your decision time. Second, an MSE on raw shares is dominated by a few rebalance days — either model logs or weight by liquidity. Third, robustly handle the event days separately if a single model underfits them. State explicitly what horizon you predict at (e.g. T-5 minutes).
Answer: Predict log auction volume from (a) liquidity/flow features, (b) a rebalance/expiry/month-end calendar that captures the fat tail, and (c) the point-in-time imbalance snapshot, validated on a forward time split with strict no-look-ahead on the imbalance feed.
Intuition
Auction size is a fat-tailed, calendar-driven quantity. Most of the predictable variance lives in two places: the slowly-varying 'normal' auction size (a function of the stock's ADV and recent regime) and a small set of scheduled events — index rebalances, expiries, month-end — that blow it up. A model trained on flow alone will chase noise and miss the events; a model with a rebalance/expiry calendar plus the late imbalance snapshot captures both. The interview signal is recognising the event-clustering and the look-ahead trap in the imbalance feed.