Sample Size Derivation via CLT

Statistics · Medium · Free problem

You want to estimate the mean $\mu$ of a population by taking $n$ i.i.d. samples. Your goal is to choose $n$ so that the sample mean $\bar{X}$ is within $\epsilon$ of $\mu$ with probability at least

- \alpha$. That is, you need:

$P(|\bar{X} - \mu| \le \epsilon) \ge 1 - \alpha$

Derive the required sample size in each of the following settings:

Known variance: The population variance $\sigma^2$ is known.

Unknown variance: The population variance is unknown and must be estimated from the data.

Autocorrelated data: The observations are serially correlated, so the naive variance of $\bar{X}$ is wrong. Use the long-run variance (Newey-West style) to get the correct sample size.

Hints

Start from the CLT: $\bar{X}$ is approximately normal. What does the confidence requirement $P(|\bar{X} - \mu| \le \epsilon) \ge 1 - \alpha$ translate to in terms of the standard normal quantile?
For parts (ii) and (iii), think about what changes in $\text{Var}(\bar{X})$. Unknown variance replaces $\sigma$ with $s$ and swaps the normal for a $t$-distribution. Autocorrelation replaces $\sigma^2$ with the long-run variance $\sigma_{LR}^2$.
For the autocorrelated case, write out $\text{Var}(\bar{X})$ by expanding the covariances: $\text{Var}(\bar{X}) = \frac{1}{n^2} \sum_{i,j} \text{Cov}(X_i, X_j)$. This leads to the long-run variance formula and the design effect DEFF $= \sigma_{LR}^2 / \sigma^2$.

Worked Solution

How to Think About It: This is one of the most fundamental calculations in statistics, and it comes up constantly in practice -- sizing a backtest, deciding how much data you need for a signal study, or figuring out if your track record is long enough to be meaningful. The logic is always the same: the CLT tells you $\bar{X}$ is approximately normal, so you just need the standard deviation of $\bar{X}$ to be small enough relative to your tolerance $\epsilon$. The three parts progressively strip away simplifying assumptions.

Quick Estimate: Suppose you want 95% confidence ($z_{0.025} = 1.96$) and your data has standard deviation $\sigma = 10$, and you want $\epsilon = 1$. Then $n \ge (1.96 \times 10 / 1)^2 = 384.16$, so you need $n \ge 385$. If the data is positively autocorrelated with, say, a design effect (DEFF) of 3, you would need roughly $385 \times 3 = 1{,}155$ observations. That is why autocorrelation is such a killer for precision -- it can triple your data requirements.

Approach: Apply the CLT in each setting, invert the probability statement to solve for $n$.

Formal Solution:

(i) Known variance $\sigma^2$:

By the CLT, $\bar{X} \approx N(\mu, \sigma^2/n)$, so $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx N(0,1)$. The confidence requirement becomes:

$P\left(\left|\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\right| \le \frac{\epsilon \sqrt{n}}{\sigma}\right) \ge 1 - \alpha$

This holds when $\frac{\epsilon \sqrt{n}}{\sigma} \ge z_{\alpha/2}$, where $z_{\alpha/2}$ is the upper $\alpha/2$ quantile of the standard normal. Solving for $n$:

$\boxed{n \ge \left(\frac{z_{\alpha/2} \, \sigma}{\epsilon}\right)^2}$

This is the textbook formula. Notice $n$ scales as $\sigma^2 / \epsilon^2$ -- halving your error tolerance quadruples your sample size.

(ii) Unknown variance:

When $\sigma$ is unknown, you replace it with the sample standard deviation $s$ and use the $t$-distribution with $n - 1$ degrees of freedom:

$n \ge \left(\frac{t_{\alpha/2, \, n-1} \cdot s}{\epsilon}\right)^2$

This is an implicit equation because the $t$-quantile depends on $n$. In practice, you handle this in one of two ways:

Pilot study approach: Collect a small pilot sample of size $n_0$, compute $s$ from it, then use the formula with $z_{\alpha/2}$ instead of $t_{\alpha/2, n-1}$ (valid when $n$ is large enough that $t \approx z$).
Iterative approach: Start with $n_0 = (z_{\alpha/2} \cdot s / \epsilon)^2$, compute $t_{\alpha/2, n_0 - 1}$, update $n$, and repeat until convergence. This typically converges in 2-3 iterations.

For $n \ge 30$ or so, $t_{\alpha/2, n-1} \approx z_{\alpha/2}$ and the distinction is negligible.

(iii) Autocorrelated data:

When observations are serially correlated, the variance of $\bar{X}$ is no longer $\sigma^2 / n$. Instead:

$\text{Var}(\bar{X}) = \frac{1}{n} \left(\gamma_0 + 2 \sum_{k=1}^{n-1} \left(1 - \frac{k}{n}\right) \gamma_k \right) \approx \frac{\sigma_{LR}^2}{n}$

where $\gamma_k = \text{Cov}(X_t, X_{t+k})$ and the long-run variance is:

$\sigma_{LR}^2 = \gamma_0 + 2 \sum_{k=1}^{\infty} \gamma_k = \sigma^2 \left(1 + 2 \sum_{k=1}^{\infty} \rho_k \right)$

The ratio $\text{DEFF} = \sigma_{LR}^2 / \sigma^2$ is called the design effect. The required sample size is:

$\boxed{n \ge \left(\frac{z_{\alpha/2} \, \sigma_{LR}}{\epsilon}\right)^2 = \text{DEFF} \cdot \left(\frac{z_{\alpha/2} \, \sigma}{\epsilon}\right)^2}$

Equivalently, you can think of it as the i.i.d. sample size divided by the effective sample fraction: $n_{\text{eff}} = n / \text{DEFF}$. In practice, $\sigma_{LR}^2$ is estimated using the Newey-West HAC estimator with a kernel bandwidth $M$ (typically $M \approx n^{1/3}$):

$\hat{\sigma}_{LR}^2 = \hat{\gamma}_0 + 2 \sum_{k=1}^{M} w(k, M) \, \hat{\gamma}_k$

where $w(k, M) = 1 - k/(M+1)$ is the Bartlett kernel weight.

For a stationary AR(1) process with autocorrelation $\rho$, the DEFF simplifies to $(1+\rho)/(1-\rho)$. For $\rho = 0.5$, DEFF $= 3$, so you need 3x as many observations as the i.i.d. formula suggests.

Answer: The required sample sizes are: (i) $n \ge (z_{\alpha/2} \sigma / \epsilon)^2$ for known variance; (ii) the same formula with $s$ replacing $\sigma$ and iterating on the $t$-quantile for unknown variance; (iii) $n \ge \text{DEFF} \cdot (z_{\alpha/2} \sigma / \epsilon)^2$ for autocorrelated data, where $\text{DEFF} = \sigma_{LR}^2 / \sigma^2$ captures the inflation due to serial correlation.

Intuition

The core insight is that sample size formulas are just inversions of the CLT. You want $\bar{X}$ to land within $\epsilon$ of $\mu$ with high probability, and the CLT tells you the spread of $\bar{X}$ shrinks like

/\sqrt{n}$. Invert that relationship and you get $n \propto \sigma^2 / \epsilon^2$. Everything else is about getting the right $\sigma^2$.

This matters enormously in practice. The most common mistake in quantitative finance is using the i.i.d. formula when the data is autocorrelated. Financial returns often have persistent volatility clustering (GARCH effects), and strategy PnL time series can have substantial serial correlation from overlapping holding periods or slow mean-reversion signals. If you ignore this, you dramatically underestimate how much data you need, leading to overconfident conclusions from short backtests. The design effect DEFF is the correction factor -- it tells you how many times more data you need compared to the i.i.d. world. Always check for autocorrelation before quoting a sample size.

Open the full interactive solver →