Predicting Hiking-Route Difficulty from an Elevation Profile

Regression · Medium · Free problem

You are given a dataset of hiking routes. Each route is a pandas DataFrame of points along the trail:

| column | meaning | |---|---| | distance_km | cumulative distance from the trailhead | | altitude_m | elevation at that point |

So a route is an ordered sequence of (distance, altitude) coordinates — an elevation *profile*, not a time series. The training set pairs each route with a difficulty score that hikers gave it in a survey; predict that score for held-out routes.

How the labels were made (this is the crux). Routes are first sorted into integer difficulty tiers — tier 6, tier 7, and so on — and that tier is essentially just *length in kilometres*: every extra kilometre bumps the route up a tier. Hikers then survey-rate each route, but they rate it relative to the other routes in its own tier. A $3.99\,\text{km}$ route is the *longest, hardest* route in tier 6, so the survey scores it near the top of that tier; a $4.01\,\text{km}$ route has just crossed into tier 7, where it is now the *shortest, easiest* one, so it scores near the bottom. The route barely changed but the rating dropped.

Walk through how you would attack this:

  1. What features would you engineer from the raw (distance, altitude) sequence? A variable-length sequence can't go straight into a model.
  2. Plot the survey score against distance_km. On top of the overall upward trend you will see a sawtooth that repeats every ~1 km — the score climbs toward each integer-km tier boundary and then drops. Explain this seasonality, and decide how it changes your features and your target.
  3. Is the seasonal sawtooth real terrain difficulty, or an artifact of how the survey was collected? Are you trying to predict the *survey* score or the *true* difficulty — and how does that choice change the model?

Assume you can ship an interactive elevation-profile viewer alongside the model so a user can brush a route and watch the engineered features (and the prediction) update.

Hints

  1. A variable-length sequence has to become a fixed-length feature vector: summary statistics of altitude and its derivative (slope), plus shape descriptors like roughness and longest sustained ascent.
  2. Plot the survey score against distance_km. There is an overall upward trend AND a sawtooth that repeats every ~1 km — that periodic component is seasonality, not terrain.
  3. The seasonality comes from peer-relative rating inside integer-km difficulty tiers: a 3.99 km route is the hardest in tier 6 (rated high), a 4.01 km route is the easiest in tier 7 (rated low). Engineer a within-tier position feature, e.g. distance_km mod 1 or sin/cos(2*pi*distance_km).

Worked Solution

How to Think About It: The regression itself is easy — the whole problem is (1) turning a variable-length elevation sequence into features and (2) understanding that the label is a peer-relative human survey with a built-in seasonality, not ground truth. Naively padding the sequence, throwing it at a tree model, and reporting an $R^2$ will look fine and miss the entire point. The structure to recognise is *trend + seasonality*, exactly like decomposing a time series.

Key Insight: The survey score decomposes into a trend (rising with length, because the integer tier is essentially length in km) plus a 1-km-period seasonal sawtooth (within each tier, hikers rate a route by its *rank among its tier-mates*, so the score climbs toward the top of the tier and resets at every integer-km boundary). The seasonal component is a reference-point artifact of how the survey was run — not extra terrain difficulty.

The Method: 1. Featurise the profile. Compute slope $s_i = \Delta\text{alt}/\Delta\text{dist}$ along the route, then fixed-length features: total distance, total elevation gain (sum of positive $\Delta\text{alt}$) and loss, mean/median/max slope, slope variance ("roughness"), number of distinct climbs (slope sign changes), longest sustained ascent, fraction of distance above a steep-grade threshold. Optionally a few low-frequency spline/Fourier coefficients of the profile shape. 2. Model the trend. Length and elevation gain carry the slow rise in difficulty across tiers. 3. Model the seasonality explicitly. Add a within-tier position feature — the fractional position inside the current km bucket, distance_km mod 1 (or the route's rank within its tier). Encoding it periodically (e.g. $\sin/\cos(2\pi\cdot\text{distance\_km})$) lets a linear model capture the sawtooth; a tree model can use the raw fractional part. This is the feature a strong candidate adds *after* seeing the sawtooth in the residuals. 4. Fit & validate. Gradient-boosted trees handle the interactions; a regularised linear model on the engineered + seasonal features is a strong, interpretable baseline. Use region/trail-grouped cross-validation so near-duplicate segments of the same trail don't leak across folds.

Practical Considerations: The dominant failure mode is seasonality blindness — not plotting score-vs-length, missing the sawtooth, and watching the model fight an artifact it can't see. Be explicit about the target: if you must reproduce the *survey* score (e.g. to match how routes are marketed), include the within-tier seasonal feature; if you want *true* difficulty, regress the seasonal component out and keep only the trend + terrain features. Watch for length confounding (a model can score well by reading length alone — check performance within a single tier) and CV leakage from correlated trail segments.

Answer: Decompose the survey label into a length-driven trend plus a 1-km-period seasonal sawtooth, engineer terrain features from the elevation profile *and* an explicit within-tier position feature for the seasonality, fit a gradient-boosted (or regularised linear) regressor, and validate with trail-grouped CV — stating whether you are predicting the gamed survey score or de-seasonalised true difficulty.

Intuition

Think of the label the way you'd decompose a time series: a trend plus a seasonal component. The trend is real — longer, higher routes are harder. The seasonality is an artifact of the survey: difficulty tiers are integer kilometres, and hikers rate each route relative to its tier-mates, so the score sawtooths up toward every km boundary and resets just past it (the 3.99 km route is the hardest in tier 6; at 4.01 km it's the easiest in tier 7). A candidate who plots score-vs-length, names the 1-km seasonality, and adds a within-tier position feature has the insight; one who fits XGBoost on raw features and reports R^2 has missed it.

Open the full interactive solver →