Two-Sample Kolmogorov-Smirnov Test
You have two independent samples: $\{X_i\}_{i=1}^{m}$ drawn from distribution $F$ and $\{Y_j\}_{j=1}^{n}$ drawn from distribution $G$. You want to test whether $F = G$ -- no parametric assumptions, no moment conditions, just the raw empirical data.
Define the empirical CDFs:
$F_m(x) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}[X_i \leq x], \quad G_n(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[Y_j \leq x]$
The two-sample Kolmogorov-Smirnov statistic is the maximum gap between these two step functions:
$D_{m,n} = \sup_{x \in \mathbb{R}} |F_m(x) - G_n(x)|$
Address the following:
- State the asymptotic null distribution of $D_{m,n}$ (under $H_0: F = G$) and give the p-value approximation formula.
- Outline an $O((m+n)\log(m+n))$ algorithm to compute $D_{m,n}$.
- Discuss the practical limitations of the KS test -- when does it work well, when does it break down, and what should you use instead?
Hints
- Under $H_0$, the two empirical CDFs track the same underlying $F$. Think about how fast the max gap $D_{m,n}$ shrinks as the sample sizes grow, and what rescaling would give you a non-degenerate limit.
- The Kolmogorov distribution is the limiting distribution of $\sup_{t}|B(t)|$ where $B$ is a Brownian bridge -- the same object that appears in the one-sample KS test, just with effective sample size $n_{\text{eff}} = mn/(m+n)$ replacing $n$.
- For the algorithm: the supremum of $|F_m(x) - G_n(x)|$ is only attained at one of the $m+n$ data points. Merge-sort the combined sample and sweep through, tracking $c_X/m - c_Y/n$ at each step.
Worked Solution
How to Think About It: The KS test is the nonparametric go-to for checking whether two samples come from the same distribution. Before any formulas, think about what $D_{m,n}$ actually measures: it is the worst-case discrepancy between two step functions. If both samples come from the same distribution, those two step functions should track each other closely -- the max gap should shrink as $m, n \to \infty$. The question is how fast and with what distribution.
The key insight is that under $H_0$, the KS statistic (after appropriate scaling) converges to a distribution that depends only on the sample sizes, not on the underlying $F$ -- that is the miracle of distribution-free tests. The algorithm part is just a merge-sort observation: the supremum of $|F_m - G_n|$ is only achieved at data points, so you only need to evaluate the gap at $m + n$ locations.
Part (a): Asymptotic Null Distribution and P-Value
Under $H_0: F = G$, both samples are effectively drawn from the same distribution. The scaled statistic converges in distribution:
$\sqrt{\frac{mn}{m+n}} \, D_{m,n} \xrightarrow{d} K \quad \text{as } m, n \to \infty$
where $K$ follows the Kolmogorov distribution, with CDF:
$P(K \leq t) = 1 - 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 t^2}$
The p-value approximation uses the effective sample size $n_{\text{eff}} = mn/(m+n)$ (the harmonic-mean analog):
$p\text{-value} \approx 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 \lambda^2}, \quad \lambda = \sqrt{n_{\text{eff}}} \cdot D_{m,n}$
In practice the series converges very quickly -- even one or two terms gives good accuracy for $\lambda > 0.5$. The common single-term approximation is:
$p\text{-value} \approx 2 e^{-2\lambda^2}$
This is exact only in the large-sample limit and for continuous $F$. For small samples, use tabulated critical values or permutation tests.
Part (b): $O((m+n)\log(m+n))$ Algorithm
The supremum $\sup_x |F_m(x) - G_n(x)|$ is only achieved at a jump point of $F_m$ or $G_n$ -- that is, at one of the $m+n$ observed data values. Between data points, both empirical CDFs are flat, so the gap is constant.
Algorithm:
1. Merge and sort all $m+n$ observations, tagging each with its source ($X$ or $Y$). Cost: $O((m+n)\log(m+n))$. 2. Sweep left to right. Maintain running counts $c_X$ (number of $X$ values seen) and $c_Y$ (number of $Y$ values seen). At each data point, update the appropriate counter and compute: $\text{gap} = \left|\frac{c_X}{m} - \frac{c_Y}{n}\right|$ 3. Track the maximum gap across all $m+n$ evaluation points. 4. Return $D_{m,n} = \max \text{ gap}$.
The sort dominates at $O((m+n)\log(m+n))$; the sweep is $O(m+n)$. No integration, no grid search needed.
Ties: If $X_i = Y_j$ for some $i, j$, evaluate the gap both before and after the tied value (left and right limits) and take the max. In practice: when sorting, process all $X$ values at a tie before all $Y$ values, then all $Y$ before all $X$, take the max of both orderings.
Part (c): Practical Limitations
- Continuous distributions only. The asymptotic Kolmogorov distribution assumes $F$ is continuous. With discrete data (counts, rounded prices), the test becomes conservative -- the p-value is overstated. Use permutation KS or a discrete-adjusted version.
- Power is weakest in the tails. The KS statistic weights all parts of the CDF equally. Tests targeting the center of the distribution (like the Kuiper test) or the tails specifically will have more power if that is where the distributions differ.
- Sensitive to location, less to shape. A shift in the mean will be caught easily. Differences in variance or higher moments that preserve the median region can slip through, especially at moderate sample sizes.
- Small samples. The asymptotic approximation is unreliable for $m, n < 25$ or so. Use exact critical values or permutation tests.
- Alternatives: The Anderson-Darling test weights the tails more heavily and has better power for tail differences. The Cramer-von Mises test uses an integrated squared difference. For multivariate data, the two-sample KS test does not generalize cleanly -- use energy distance tests or maximum mean discrepancy (MMD) instead.
Answer: Under $H_0$, the scaled statistic $\sqrt{mn/(m+n)} \cdot D_{m,n}$ converges to the Kolmogorov distribution. The p-value is approximated by