Predicting Marathon Finish Time from a Training Log

Regression · Medium · Free problem

Each runner has a 16-week training log as a pandas DataFrame, one row per run:

| column | meaning | |---|---| | date | date of the run | | distance_km | logged distance | | avg_pace | minutes per km | | avg_hr | average heart rate |

The target is the runner's actual marathon finish time. Train on runners with known results, predict for new ones.

Talk through:

  1. What features summarise a training block? The log is an irregular time series of varying length.
  2. distance_km is self-reported and avg_pace comes from consumer GPS watches. Plot the distribution of logged distances and think about how clean these inputs really are.
  3. How would you build a validation scheme, and what target transform makes sense?

Hints

  1. Finish time is driven by fitness and by training volume/consistency. Weekly mileage, longest run, taper depth, and a fitness proxy (pace at a fixed heart rate) are strong features.
  2. Self-reported distances bunch at round numbers (5, 10, 21.1 km) and GPS pace drifts under tree cover and in tunnels. The cleanest-looking feature can be the most corrupted.
  3. Pace-at-given-HR is more trustworthy than raw pace because it normalises effort; prefer features that are robust to GPS noise.

Worked Solution

How to Think About It: Finish time is well predicted by training *volume*, *consistency*, and *fitness*, so the modelling is gentle — the craft is summarising an irregular, self-reported time series into robust features and not being fooled by dirty inputs. Treat the GPS/self-report noise as a first-class concern.

Key Insight: The prettiest columns (distance_km, avg_pace) are the dirtiest — self-reported and GPS-derived. Robust features beat raw ones.

The Method: 1. Volume & consistency. Total and peak weekly mileage, number of runs, longest single run, fraction of planned weeks actually completed, and the mileage ramp (slope of weekly volume). 2. Fitness proxy. Regress pace on heart rate within-runner and use the fitted pace at a reference HR (say 150 bpm) — this normalises effort and is robust to GPS pace noise. 3. Taper. Depth and timing of the volume drop in the final 2–3 weeks. 4. Target transform. Predict log finish time (or speed = distance/time), which is better-behaved than raw minutes. 5. Model & validate. A regularised linear model or gradient-boosted trees; validate with K-fold over *runners* (each runner appears once, so simple K-fold is fine — but never split a runner's weeks across folds).

Practical Considerations: Inspect the distance_km histogram — it bunches at round numbers and race distances, evidence of self-reporting; consider deriving volume from duration if available, or down-weighting reported distance. GPS pace drifts, so prefer HR-normalised features. Beware survivorship: runners who logged a full block and raced are not a random sample. Cold-start runners with sparse logs need wider uncertainty.

Answer: Summarise the block with volume, consistency, taper, and an HR-normalised fitness proxy; predict log finish time with a regularised model; validate per-runner; and explicitly down-weight the self-reported, GPS-noisy raw distance/pace in favour of robust features.

Intuition

Finish time is mostly explained by a few robust aggregates of the training block — total volume, the longest run, and a fitness proxy like pace at a reference heart rate. The trap is data quality: logged distances are self-reported and round-number-bunched, and GPS pace is noisy. A candidate who builds noise-robust features (HR-normalised pace, volume from time rather than reported distance) and validates without leaking correlated weeks beats one who trusts the rawest, prettiest columns.

Open the full interactive solver →