Predicting Hiking-Route Difficulty from an Elevation Profile
You are given a dataset of hiking routes. Each route is a pandas DataFrame of points along the trail:
| column | meaning | |---|---| | distance_km | cumulative distance from the trailhead | | altitude_m | elevation at that point |
So a route is an ordered sequence of (distance, altitude) coordinates — an elevation *profile*, not a time series. The training set pairs each route with a difficulty score that hikers gave it in a survey; predict that score for held-out routes.
How the labels were made (this is the crux). Routes are first sorted into integer difficulty tiers — tier 6, tier 7, and so on — and that tier is essentially just *length in kilometres*: every extra kilometre bumps the route up a tier. Hikers then survey-rate each route, but they rate it relative to the other routes in its own tier. A $3.99\,\text{km}$ route is the *longest, hardest* route in tier 6, so the survey scores it near the top of that tier; a $4.01\,\text{km}$ route has just crossed into tier 7, where it is now the *shortest, easiest* one, so it scores near the bottom. The route barely changed but the rating dropped.
Walk through how you would attack this:
- What features would you engineer from the raw
(distance, altitude)sequence? A variable-length sequence can't go straight into a model. - Plot the survey score against
distance_km. On top of the overall upward trend you will see a sawtooth that repeats every ~1 km — the score climbs toward each integer-km tier boundary and then drops. Explain this seasonality, and decide how it changes your features and your target. - Is the seasonal sawtooth real terrain difficulty, or an artifact of how the survey was collected? Are you trying to predict the *survey* score or the *true* difficulty — and how does that choice change the model?
Assume you can ship an interactive elevation-profile viewer alongside the model so a user can brush a route and watch the engineered features (and the prediction) update.
Hints
- A variable-length sequence has to become a fixed-length feature vector: summary statistics of altitude and its derivative (slope), plus shape descriptors like roughness and longest sustained ascent.
- Plot the survey score against distance_km. There is an overall upward trend AND a sawtooth that repeats every ~1 km — that periodic component is seasonality, not terrain.
- The seasonality comes from peer-relative rating inside integer-km difficulty tiers: a 3.99 km route is the hardest in tier 6 (rated high), a 4.01 km route is the easiest in tier 7 (rated low). Engineer a within-tier position feature, e.g. distance_km mod 1 or sin/cos(2*pi*distance_km).
Worked Solution
How to Think About It: The regression itself is easy — the whole problem is (1) turning a variable-length elevation sequence into features and (2) understanding that the label is a peer-relative human survey with a built-in seasonality, not ground truth. Naively padding the sequence, throwing it at a tree model, and reporting an $R^2$ will look fine and miss the entire point. The structure to recognise is *trend + seasonality*, exactly like decomposing a time series.
Key Insight: The survey score decomposes into a trend (rising with length, because the integer tier is essentially length in km) plus a 1-km-period seasonal sawtooth (within each tier, hikers rate a route by its *rank among its tier-mates*, so the score climbs toward the top of the tier and resets at every integer-km boundary). The seasonal component is a reference-point artifact of how the survey was run — not extra terrain difficulty.
The Method: 1. Featurise the profile. Compute slope $s_i = \Delta\text{alt}/\Delta\text{dist}$ along the route, then fixed-length features: total distance, total elevation gain (sum of positive $\Delta\text{alt}$) and loss, mean/median/max slope, slope variance ("roughness"), number of distinct climbs (slope sign changes), longest sustained ascent, fraction of distance above a steep-grade threshold. Optionally a few low-frequency spline/Fourier coefficients of the profile shape. 2. Model the trend. Length and elevation gain carry the slow rise in difficulty across tiers. 3. Model the seasonality explicitly. Add a within-tier position feature — the fractional position inside the current km bucket, distance_km mod 1 (or the route's rank within its tier). Encoding it periodically (e.g. $\sin/\cos(2\pi\cdot\text{distance\_km})$) lets a linear model capture the sawtooth; a tree model can use the raw fractional part. This is the feature a strong candidate adds *after* seeing the sawtooth in the residuals. 4. Fit & validate. Gradient-boosted trees handle the interactions; a regularised linear model on the engineered + seasonal features is a strong, interpretable baseline. Use region/trail-grouped cross-validation so near-duplicate segments of the same trail don't leak across folds.
Practical Considerations: The dominant failure mode is seasonality blindness — not plotting score-vs-length, missing the sawtooth, and watching the model fight an artifact it can't see. Be explicit about the target: if you must reproduce the *survey* score (e.g. to match how routes are marketed), include the within-tier seasonal feature; if you want *true* difficulty, regress the seasonal component out and keep only the trend + terrain features. Watch for length confounding (a model can score well by reading length alone — check performance within a single tier) and CV leakage from correlated trail segments.
Answer: Decompose the survey label into a length-driven trend plus a 1-km-period seasonal sawtooth, engineer terrain features from the elevation profile *and* an explicit within-tier position feature for the seasonality, fit a gradient-boosted (or regularised linear) regressor, and validate with trail-grouped CV — stating whether you are predicting the gamed survey score or de-seasonalised true difficulty.
Intuition
Think of the label the way you'd decompose a time series: a trend plus a seasonal component. The trend is real — longer, higher routes are harder. The seasonality is an artifact of the survey: difficulty tiers are integer kilometres, and hikers rate each route relative to its tier-mates, so the score sawtooths up toward every km boundary and resets just past it (the 3.99 km route is the hardest in tier 6; at 4.01 km it's the easiest in tier 7). A candidate who plots score-vs-length, names the 1-km seasonality, and adds a within-tier position feature has the insight; one who fits XGBoost on raw features and reports R^2 has missed it.