Predicting Marathon Finish Time from a Training Log

Question

Each runner has a 16-week training log as a pandas DataFrame, one row per run: | column | meaning | |---|---| | date | date of the run | | distance_km | logged distance | | avg_pace | minutes per km | | avg_hr | average heart rate | The target is the runner's actual marathon finish time. Train on r…

Accepted Answer

How to Think About It: Finish time is well predicted by training *volume*, *consistency*, and *fitness*, so the modelling is gentle — the craft is summarising an irregular, self-reported time series into robust features and not being fooled by dirty inputs. Treat the GPS/self-report noise as a first-class concern. Key Insight: The prettiest columns (distance_km, avg_pace) are the dirtiest — self-reported and GPS-derived. Robust features beat raw ones. The Method: 1. Volume & consistency. Total and peak weekly mileage, number of runs, longest single run, fraction of planned weeks actually completed, and the mileage ramp (slope of weekly volume). 2. Fitness proxy. Regress pace on heart rate within-runner and use the fitted pace at a reference HR (say 150 bpm) — this normalises effort and is robust to GPS pace noise. 3. Taper. Depth and timing of the volume drop in the final 2–3 weeks. 4. Target transform. Predict log finish time (or speed = distance/time), which is better-behaved than raw minutes. 5. Model & validate. A regularised linear model or gradient-boosted trees; validate with…

Predicting Marathon Finish Time from a Training Log

Hints

Worked Solution

Intuition