Online Outlier Detection in a Data Stream

Statistics · Medium · Free problem

You are receiving a stream of 1D numerical observations one at a time. You need to flag outliers in real time -- you cannot store all past data, and the stream may run indefinitely.

Design a simple algorithm for online outlier detection. Specifically:

  1. Define precisely what you consider an "outlier."
  2. Describe an algorithm that processes each new observation in $O(1)$ time and $O(1)$ space, deciding immediately whether to flag it.
  3. Justify why your method works under reasonable assumptions, and discuss when it breaks down.

Hints

  1. You need statistics that update incrementally -- think about what summary statistics you can maintain with $O(1)$ storage.
  2. Welford's algorithm lets you compute exact running mean and variance with a single pass. How would you use these to define an outlier?
  3. Test each new point against the statistics computed *before* including it -- otherwise the outlier contaminates its own test. Consider what happens when the data distribution is non-stationary.

Worked Solution

How to Think About It: The core challenge is that you cannot look at the whole dataset -- you only see one point at a time and need to decide right now whether it is anomalous. The classic offline approach (compute the full-sample mean and standard deviation, flag anything beyond 3 standard deviations) is not available because you do not have the full sample. So you need running statistics that update incrementally. The natural idea is to maintain a running mean and variance, then flag a new point if it is too far from the current mean relative to the current spread. This is essentially a sequential z-score test.

Key Insight: Welford's online algorithm lets you maintain exact running mean and variance in $O(1)$ time and space per update, without storing any past data. Combined with a z-score threshold, this gives you a simple, principled outlier detector.

The Method:

  1. Initialize: Set $n = 0$, $\mu_0 = 0$, $M_0 = 0$. Choose a threshold $k$ (typically $k = 3$).

2. On receiving observation $x_n$: - Compute the z-score against the *previous* statistics: $z = |x_n - \mu_{n-1}| \,/\, \sigma_{n-1}$, where $\sigma_{n-1} = \sqrt{M_{n-1} / (n-1)}$. - If $z > k$, flag $x_n$ as an outlier. - Update the running statistics using Welford's recurrence:

$\mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n}$

$M_n = M_{n-1} + (x_n - \mu_{n-1})(x_n - \mu_n)$

$\sigma_n^2 = \frac{M_n}{n}$

  1. Outlier definition: A point $x_n$ is an outlier if it falls more than $k$ standard deviations from the running mean, i.e., $|x_n - \mu_{n-1}| > k \cdot \sigma_{n-1}$.
  1. Important detail: Test against the *previous* mean and variance (before incorporating $x_n$). If you update first and then test, the outlier itself contaminates the statistics you are testing against, reducing your detection power.

Practical Considerations:

  • Warm-up period: You need at least 20-30 observations before the running variance stabilizes. During the warm-up, either skip detection or use a wider threshold.
  • Non-stationarity: If the data distribution shifts over time (common in financial data), the running mean and variance computed over all history will lag. Fix: use an exponentially weighted moving average (EWMA) with decay factor $\alpha \in (0, 1)$:

$\mu_n = \alpha \, x_n + (1 - \alpha) \, \mu_{n-1}$

$\sigma_n^2 = \alpha \, (x_n - \mu_n)^2 + (1 - \alpha) \, \sigma_{n-1}^2$

This gives more weight to recent data and adapts to regime changes. Typical $\alpha$ values are 0.01-0.05. - Masking by outliers: If you include flagged outliers in the running statistics, they inflate $\sigma$ and make future outliers harder to detect. Consider excluding flagged points from the update step. - Heavy tails: The 3-sigma rule assumes approximate normality. For heavy-tailed distributions (e.g., financial returns), $k = 3$ will flag too many points. Use a higher threshold ($k = 4$ or $5$) or switch to a median/MAD-based detector. - Choice of $k$: Under normality, $k = 3$ gives a false positive rate of about 0.27%. Tighten to $k = 3.5$ or $k = 4$ if false alarms are costly.

Answer: Maintain a running mean and variance via Welford's algorithm. Flag observation $x_n$ as an outlier if $|x_n - \mu_{n-1}| > k \cdot \sigma_{n-1}$ (test before updating). This runs in $O(1)$ time and space. For non-stationary streams, replace the cumulative statistics with EWMA. The method assumes approximate normality -- for heavy-tailed data, increase $k$ or use robust alternatives like median absolute deviation.

Intuition

Online outlier detection is fundamentally about maintaining a compact summary of "normal" behavior and flagging deviations from it. The z-score approach works because it reduces the problem to a simple hypothesis test at each step: is this new point consistent with the distribution I have seen so far? Welford's algorithm is the engine that makes this possible in constant time and space -- it is one of those numerical tricks every practitioner should know by heart.

The deeper lesson is about the tension between adaptability and stability. If you weight all history equally (cumulative mean/variance), you are robust to noise but slow to adapt to regime changes. If you weight recent data heavily (small EWMA window), you adapt quickly but become jittery. In practice -- especially on a trading desk -- you often run multiple detectors at different time scales and trigger an alert when any of them fires. The other subtlety people miss is contamination: if outliers leak into your running statistics, they inflate your variance estimate and mask future outliers, creating a vicious cycle.

Open the full interactive solver →