Forecasting an Article's 7-Day Pageviews from its First Six Hours

Time Series · Medium · Free problem

Each article gives you an early-traffic pandas DataFrame plus its headline:

| column | meaning | |---|---| | hour | hours since publication (0–6) | | cumulative_views | total views so far |

and a headline string. The target is total views at day 7. Train on older articles whose 7-day totals are known.

Discuss:

  1. How do you turn a 6-hour view curve into features that extrapolate to day 7?
  2. Some articles show a sudden jump in the curve partway through. Investigate what causes those jumps before you trust them as organic signal.
  3. What target transform and validation scheme would you use?

Hints

  1. Early-curve shape extrapolates: the level at hour 6, the slope, and the curvature (is growth accelerating or saturating?) are the core features. Views are heavy-tailed, so model logs.
  2. Step jumps in the curve are usually editorial promotion (front-page or newsletter placement), an exogenous treatment unrelated to content quality. Attributing them to the headline is a mistake.
  3. Publication time-of-day and day-of-week drive strong seasonality in the early curve; normalise for them.

Worked Solution

How to Think About It: This is early-trajectory extrapolation. The first six hours contain most of the predictive signal, so the work is (1) shape features that generalise to day 7 and (2) recognising that some early traffic is *exogenous promotion*, not organic demand. Headline text is a weak secondary signal.

Key Insight: Step jumps in the early curve are usually editorial promotion — a treatment effect, not content quality. Mis-attributing them inflates predictions and corrupts the headline features.

The Method: 1. Curve-shape features. Cumulative views at hour 6, the early growth rate, and curvature (fit a simple power-law or log-log slope to see if growth is accelerating or saturating). The shape of organic attention decay is fairly universal, so the early slope extrapolates. 2. Seasonality. Publication hour-of-day and day-of-week; normalise the early curve by the expected diurnal pattern. 3. Promotion detection. Flag discontinuities/step jumps in cumulative_views (a large residual against a smooth fit) and add a "was promoted" feature, rather than letting the jump leak into other features. 4. Text features. Headline length, sentiment, presence of numbers/questions — modest, regularised. 5. Model & validate. Predict $\log(\text{7-day views})$ with gradient-boosted trees; validate on a forward time split so trend/seasonality regimes don't leak.

Practical Considerations: Views are heavy-tailed — model logs and expect a few viral outliers to dominate raw error. The main bias is treating promotion as organic; detect and isolate it. Avoid leakage from any feature computed using post-6-hour data. Headlines drift over time, so a forward split is essential.

Answer: Extrapolate from early-curve level/slope/curvature plus diurnal normalisation, explicitly detect and flag editorial-promotion jumps, add light headline-text features, and predict log 7-day views with a forward-validated tree model.

Intuition

Most of the 7-day total is already encoded in the shape of the first few hours — level, slope, and whether growth is accelerating or saturating. The subtlety is that some early curves are inflated by editorial promotion, an exogenous jump that says nothing about the content. A model that treats promotion jumps as organic will over-predict and mis-credit the headline. Detecting the jumps (and seasonality) is the real skill.

Open the full interactive solver →