Two-Sample Kolmogorov-Smirnov Test

Statistics · Hard · Free problem

You have two independent samples: $\{X_i\}_{i=1}^{m}$ drawn from distribution $F$ and $\{Y_j\}_{j=1}^{n}$ drawn from distribution $G$. You want to test whether $F = G$ -- no parametric assumptions, no moment conditions, just the raw empirical data.

Define the empirical CDFs:

$F_m(x) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}[X_i \leq x], \quad G_n(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[Y_j \leq x]$

The two-sample Kolmogorov-Smirnov statistic is the maximum gap between these two step functions:

$D_{m,n} = \sup_{x \in \mathbb{R}} |F_m(x) - G_n(x)|$

Address the following:

  1. State the asymptotic null distribution of $D_{m,n}$ (under $H_0: F = G$) and give the p-value approximation formula.
  2. Outline an $O((m+n)\log(m+n))$ algorithm to compute $D_{m,n}$.
  3. Discuss the practical limitations of the KS test -- when does it work well, when does it break down, and what should you use instead?

Hints

  1. Under $H_0$, the two empirical CDFs track the same underlying $F$. Think about how fast the max gap $D_{m,n}$ shrinks as the sample sizes grow, and what rescaling would give you a non-degenerate limit.
  2. The Kolmogorov distribution is the limiting distribution of $\sup_{t}|B(t)|$ where $B$ is a Brownian bridge -- the same object that appears in the one-sample KS test, just with effective sample size $n_{\text{eff}} = mn/(m+n)$ replacing $n$.
  3. For the algorithm: the supremum of $|F_m(x) - G_n(x)|$ is only attained at one of the $m+n$ data points. Merge-sort the combined sample and sweep through, tracking $c_X/m - c_Y/n$ at each step.

Worked Solution

How to Think About It: The KS test is the nonparametric go-to for checking whether two samples come from the same distribution. Before any formulas, think about what $D_{m,n}$ actually measures: it is the worst-case discrepancy between two step functions. If both samples come from the same distribution, those two step functions should track each other closely -- the max gap should shrink as $m, n \to \infty$. The question is how fast and with what distribution.

The key insight is that under $H_0$, the KS statistic (after appropriate scaling) converges to a distribution that depends only on the sample sizes, not on the underlying $F$ -- that is the miracle of distribution-free tests. The algorithm part is just a merge-sort observation: the supremum of $|F_m - G_n|$ is only achieved at data points, so you only need to evaluate the gap at $m + n$ locations.

Part (a): Asymptotic Null Distribution and P-Value

Under $H_0: F = G$, both samples are effectively drawn from the same distribution. The scaled statistic converges in distribution:

$\sqrt{\frac{mn}{m+n}} \, D_{m,n} \xrightarrow{d} K \quad \text{as } m, n \to \infty$

where $K$ follows the Kolmogorov distribution, with CDF:

$P(K \leq t) = 1 - 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 t^2}$

The p-value approximation uses the effective sample size $n_{\text{eff}} = mn/(m+n)$ (the harmonic-mean analog):

$p\text{-value} \approx 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 \lambda^2}, \quad \lambda = \sqrt{n_{\text{eff}}} \cdot D_{m,n}$

In practice the series converges very quickly -- even one or two terms gives good accuracy for $\lambda > 0.5$. The common single-term approximation is:

$p\text{-value} \approx 2 e^{-2\lambda^2}$

This is exact only in the large-sample limit and for continuous $F$. For small samples, use tabulated critical values or permutation tests.

Part (b): $O((m+n)\log(m+n))$ Algorithm

The supremum $\sup_x |F_m(x) - G_n(x)|$ is only achieved at a jump point of $F_m$ or $G_n$ -- that is, at one of the $m+n$ observed data values. Between data points, both empirical CDFs are flat, so the gap is constant.

Algorithm:

1. Merge and sort all $m+n$ observations, tagging each with its source ($X$ or $Y$). Cost: $O((m+n)\log(m+n))$. 2. Sweep left to right. Maintain running counts $c_X$ (number of $X$ values seen) and $c_Y$ (number of $Y$ values seen). At each data point, update the appropriate counter and compute: $\text{gap} = \left|\frac{c_X}{m} - \frac{c_Y}{n}\right|$ 3. Track the maximum gap across all $m+n$ evaluation points. 4. Return $D_{m,n} = \max \text{ gap}$.

The sort dominates at $O((m+n)\log(m+n))$; the sweep is $O(m+n)$. No integration, no grid search needed.

Ties: If $X_i = Y_j$ for some $i, j$, evaluate the gap both before and after the tied value (left and right limits) and take the max. In practice: when sorting, process all $X$ values at a tie before all $Y$ values, then all $Y$ before all $X$, take the max of both orderings.

Part (c): Practical Limitations

  • Continuous distributions only. The asymptotic Kolmogorov distribution assumes $F$ is continuous. With discrete data (counts, rounded prices), the test becomes conservative -- the p-value is overstated. Use permutation KS or a discrete-adjusted version.
  • Power is weakest in the tails. The KS statistic weights all parts of the CDF equally. Tests targeting the center of the distribution (like the Kuiper test) or the tails specifically will have more power if that is where the distributions differ.
  • Sensitive to location, less to shape. A shift in the mean will be caught easily. Differences in variance or higher moments that preserve the median region can slip through, especially at moderate sample sizes.
  • Small samples. The asymptotic approximation is unreliable for $m, n < 25$ or so. Use exact critical values or permutation tests.
  • Alternatives: The Anderson-Darling test weights the tails more heavily and has better power for tail differences. The Cramer-von Mises test uses an integrated squared difference. For multivariate data, the two-sample KS test does not generalize cleanly -- use energy distance tests or maximum mean discrepancy (MMD) instead.

Answer: Under $H_0$, the scaled statistic $\sqrt{mn/(m+n)} \cdot D_{m,n}$ converges to the Kolmogorov distribution. The p-value is approximated by

\sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^2\lambda^2}$ with $\lambda = \sqrt{mn/(m+n)} \cdot D_{m,n}$. Computing $D_{m,n}$ requires only a sort plus a linear sweep over merged data. Use with caution on discrete distributions, small samples, or when you care about tail behavior specifically.

Intuition

The two-sample KS test is one of the cleanest examples of a distribution-free test: the null distribution of the rescaled statistic does not depend on $F$ at all, as long as $F$ is continuous. This is because, under $H_0$, you can reduce the problem to the uniform distribution by the probability integral transform -- both samples, after transformation by the true CDF, become $\text{Uniform}(0,1)$, and the KS statistic only cares about ranks, not actual values. That rank-based invariance is what makes the asymptotic distribution universal.

In practice, the KS test has a well-known Achilles heel: it is most sensitive near the median and least sensitive in the tails. This matters a lot in finance, where you often care most about tail behavior -- whether two return distributions agree in the extremes, not at the center. For that use case, Anderson-Darling (which weights the tails more) or tail-specific tests are better choices. The other common trap is applying the continuous-$F$ asymptotic p-value to discrete data (tick counts, rounded prices), which makes the test overly conservative and causes you to miss real differences.

Open the full interactive solver →