Modeling Bike-Share Demand
You receive hourly bike-share demand counts $Y_t$ from a single station over one full year (roughly 8,760 observations).
- Feature engineering and baseline model. Propose a set of features (calendar variables, lagged weather, etc.), describe a train/validation/test split that respects temporal ordering, and specify a baseline model (e.g., a Poisson GLM with a log link). Justify your choices.
- Evaluation. Specify at least two evaluation metrics (e.g., out-of-sample log-likelihood, RMSE, MAPE). Explain how you would compute a test-set $t$-statistic comparing your model's improvement over a simple seasonal baseline (e.g., same-hour-last-week average).
- Residual diagnostics. Describe one rigorous residual diagnostic you would run on the fitted model and explain how its results would inform whether and how you retrain.
Hints
- The most important constraint is respecting temporal ordering -- how would you split data so that no future information leaks into training?
- Since the response is a count, what GLM family naturally handles non-negative integers? Think about what happens when variance exceeds the mean.
- For the $t$-test comparing models, standard errors are wrong if residuals are autocorrelated. Use Newey-West (HAC) standard errors with bandwidth matching the dependence range.
Worked Solution
How to Think About It: This is a bread-and-butter applied ML question that tests whether you can build a forecasting pipeline end-to-end without committing the cardinal sin of time-series work: leaking future information into training. The core tension is that bike demand has strong daily and weekly seasonality, weather dependence, and potentially non-stationary trends -- and your evaluation must respect the arrow of time. A Poisson GLM is a natural starting point because the response is a count, but you should be ready to discuss overdispersion and when to upgrade to a negative binomial or tree-based model.
Key Insight: The most common mistake is using standard K-fold cross-validation on time series data. Any split must ensure that all training data precedes all validation/test data, or at minimum uses a walk-forward scheme.
The Method:
Part 1 -- Features and Baseline Model
Features to include: - Calendar: hour of day (one-hot or cyclic encoding via $\sin/\cos$), day of week, month, holiday indicator, weekend flag - Weather (lagged): temperature, precipitation, wind speed -- use values from the previous hour or the same hour on the prior day to avoid look-ahead bias - Lagged demand: $Y_{t-1}$, $Y_{t-24}$ (same hour yesterday), $Y_{t-168}$ (same hour last week) - Trend: a linear time index or month-level indicator to capture drift
Train/validation/test split: - Train: first 8 months (Jan--Aug) - Validation: next 2 months (Sep--Oct) -- used for hyperparameter tuning - Test: final 2 months (Nov--Dec) -- used only for final evaluation - No shuffling across time boundaries
Baseline model: Poisson GLM with log link: $\log E[Y_t] = \beta_0 + \sum_j \beta_j x_{tj}$ where $x_{tj}$ are the features above. This is a natural choice because $Y_t$ is a non-negative count. If the variance substantially exceeds the mean (overdispersion), upgrade to a negative binomial GLM or a quasi-Poisson.
Part 2 -- Evaluation Metrics and Statistical Testing
Metrics: - Out-of-sample log-likelihood (or equivalently, average Poisson deviance): directly measures calibration for count data - RMSE: penalizes large errors, easy to interpret in units of bike counts - MAPE: useful for relative error, but beware division-by-zero when $Y_t = 0$ (use SMAPE or exclude zeros)
To test improvement over the seasonal baseline, compute the per-observation loss difference: $d_t = L_{\text{baseline}}(t) - L_{\text{model}}(t)$ where $L$ is, say, squared error. Then the test statistic is: $t = \frac{\bar{d}}{\hat{\sigma}_d / \sqrt{n_{\text{test}}}}$ Since the $d_t$ are serially correlated, use Newey-West standard errors (with bandwidth set to the autocorrelation range, e.g., 168 hours for weekly dependence) rather than naive standard errors. This gives a valid $t$-test for whether the model statistically outperforms the baseline.
Part 3 -- Residual Diagnostic
Compute randomized quantile residuals (Dunn-Smyth residuals). For a Poisson model with fitted mean $\hat{\mu}_t$, the quantile residual maps the observed $Y_t$ through the fitted CDF and then through $\Phi^{-1}$. If the model is correct, these residuals are i.i.d. $N(0,1)$.
Diagnostic checks: - Plot the ACF of the residuals. Significant autocorrelation at lag 24 or 168 signals that the model is not capturing daily or weekly patterns -- add more lag features or seasonal dummies. - Plot residuals vs. fitted values. A fan shape indicates overdispersion -- switch to negative binomial. - Run a Ljung-Box test at lag 24. A significant result triggers retraining with additional autoregressive terms.
This diagnostic directly informs retraining: if the ACF shows spikes at specific lags, add those lags as features; if variance is non-constant, change the distributional assumption.
Practical Considerations: - Walk-forward validation (retrain monthly, predict next month) is more realistic than a single static split, but more expensive - Feature engineering is where most of the value lives -- the choice between Poisson GLM and XGBoost matters less than getting the features and evaluation right - Always check for data quality issues: missing hours, station outages, zero-inflated periods
Answer: Use a Poisson GLM with calendar, lagged weather, and lagged demand features, trained on a time-respecting split. Evaluate with out-of-sample log-likelihood and RMSE, using a Newey-West corrected $t$-test against a seasonal baseline. Run Dunn-Smyth residual diagnostics with ACF plots to identify missing temporal structure and inform retraining decisions.
Intuition
This problem tests the full pipeline of applied time-series modeling, and the interviewer is looking for practical judgment more than any single formula. The key lesson is that evaluation design matters more than model complexity: a fancy model evaluated with naive K-fold CV will give you a wildly overoptimistic picture, while a simple Poisson GLM evaluated with proper walk-forward validation gives you honest performance estimates. In real quant work, this same pattern shows up everywhere -- backtesting trading strategies, validating risk models, evaluating alpha signals. The methodology (time-respecting splits, HAC standard errors, proper residual diagnostics) is the same regardless of whether you are predicting bike counts or stock returns.
The residual diagnostic piece is where many candidates fall short. Knowing to use randomized quantile residuals rather than raw Pearson residuals, and knowing that autocorrelation in residuals directly tells you which lag features are missing, separates someone who has built real models from someone who has only read about them.