Streaming Quantile Approximation with T-Digest
You need to track approximate quantiles on a stream of numbers in a single pass. Your data structure should give higher resolution in the tails (near quantile 0 and quantile 1) and coarser resolution in the middle.
Design a T-Digest-style structure that maintains $C$ cluster centroids, where each cluster has a mean and a weight. The key constraint is that cluster sizes must shrink near quantiles 0 and 1 -- specifically, the maximum weight a cluster is allowed to have depends on how close its quantile rank is to the edges.
1. Define the data structure and its invariants. What does each cluster store, and what size-limit function controls how large a cluster can grow as a function of its quantile position $q \in [0,1]$?
2. Describe the **insert** procedure for a new data point and the **merge** procedure for combining two T-Digests. Both should maintain the size-limit invariant. Target $O(\log C)$ time per insert and $O(C)$ total space.
3. Prove a bound on the rank error of a quantile query as a function of the total number of clusters $C$.
4. Explain why this variable-size clustering gives better tail accuracy than uniform binning (i.e., splitting the data range into $C$ equal-width or equal-count bins).
Open the full interactive solver, hints, and worked solution →