Forecasting an Article's 7-Day Pageviews from its First Six Hours
Each article gives you an early-traffic pandas DataFrame plus its headline:
| column | meaning | |---|---| | hour | hours since publication (0–6) | | cumulative_views | total views so far |
and a headline string. The target is total views at day 7. Train on older articles whose 7-day totals are known.
Discuss:
- How do you turn a 6-hour view curve into features that extrapolate to day 7?
- Some articles show a sudden jump in the curve partway through. Investigate what causes those jumps before you trust them as organic signal.
- What target transform and validation scheme would you use?
Hints
- Early-curve shape extrapolates: the level at hour 6, the slope, and the curvature (is growth accelerating or saturating?) are the core features. Views are heavy-tailed, so model logs.
- Step jumps in the curve are usually editorial promotion (front-page or newsletter placement), an exogenous treatment unrelated to content quality. Attributing them to the headline is a mistake.
- Publication time-of-day and day-of-week drive strong seasonality in the early curve; normalise for them.
Worked Solution
How to Think About It: This is early-trajectory extrapolation. The first six hours contain most of the predictive signal, so the work is (1) shape features that generalise to day 7 and (2) recognising that some early traffic is *exogenous promotion*, not organic demand. Headline text is a weak secondary signal.
Key Insight: Step jumps in the early curve are usually editorial promotion — a treatment effect, not content quality. Mis-attributing them inflates predictions and corrupts the headline features.
The Method: 1. Curve-shape features. Cumulative views at hour 6, the early growth rate, and curvature (fit a simple power-law or log-log slope to see if growth is accelerating or saturating). The shape of organic attention decay is fairly universal, so the early slope extrapolates. 2. Seasonality. Publication hour-of-day and day-of-week; normalise the early curve by the expected diurnal pattern. 3. Promotion detection. Flag discontinuities/step jumps in cumulative_views (a large residual against a smooth fit) and add a "was promoted" feature, rather than letting the jump leak into other features. 4. Text features. Headline length, sentiment, presence of numbers/questions — modest, regularised. 5. Model & validate. Predict $\log(\text{7-day views})$ with gradient-boosted trees; validate on a forward time split so trend/seasonality regimes don't leak.
Practical Considerations: Views are heavy-tailed — model logs and expect a few viral outliers to dominate raw error. The main bias is treating promotion as organic; detect and isolate it. Avoid leakage from any feature computed using post-6-hour data. Headlines drift over time, so a forward split is essential.
Answer: Extrapolate from early-curve level/slope/curvature plus diurnal normalisation, explicitly detect and flag editorial-promotion jumps, add light headline-text features, and predict log 7-day views with a forward-validated tree model.
Intuition
Most of the 7-day total is already encoded in the shape of the first few hours — level, slope, and whether growth is accelerating or saturating. The subtlety is that some early curves are inflated by editorial promotion, an exogenous jump that says nothing about the content. A model that treats promotion jumps as organic will over-predict and mis-credit the headline. Detecting the jumps (and seasonality) is the real skill.