Size-Biased Sampling and the Harmonic Mean Correction

Statistics · Medium · Free problem

You want to estimate the average number of residents per building in a city. Your plan: sample 100 people on the street and ask each one how many people live in their building. Call these responses $X_1, X_2, \ldots, X_{100}$.

  1. Explain why the naive sample average $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is a biased estimator of the true average building size. Which direction does the bias go?

2. Derive the correct estimator. Show that the harmonic mean of the sample, $\hat{\mu} = \frac{n}{\sum_{i=1}^n 1/X_i},$ is a consistent estimator of the true population average building size.

  1. Give an intuitive explanation for why the harmonic mean works here.

Hints

  1. Think about who you are actually sampling. Are you drawing buildings at random, or people at random? How does building size affect who shows up in your sample?
  2. Under size-biased sampling with distribution $g(s) \propto s \cdot f(s)$, compute $E_g[1/X]$. The cancellation is the key to the correction.
  3. Each person represents
    /X_i$ of a building. Sum these fractions to get the effective number of buildings in your sample, then compute people per building.

Worked Solution

How to Think About It: The trap in this problem is that you are not sampling buildings -- you are sampling people. A building with 500 residents sends 500 people onto the street, while a building with 2 residents sends 2. So your sample massively overrepresents large buildings. This is called size-biased sampling (or length-biased sampling), and it shows up everywhere in practice: surveying bus passengers overestimates average bus occupancy, surveying students overestimates average class size, and so on. The fix is always the same: you need to "undo" the size bias by weighting each observation inversely by its size.

Quick Estimate: Suppose the city has just three types of buildings: 100 buildings with 5 residents, 50 buildings with 20 residents, and 10 buildings with 100 residents. The true average building size is: $\mu = \frac{100 \cdot 5 + 50 \cdot 20 + 10 \cdot 100}{160} = \frac{500 + 1000 + 1000}{160} = \frac{2500}{160} = 15.625$ But the total number of people is 2500, and of those, 500 live in small buildings, 1000 in medium, and 1000 in large. So if you sample a random person, the probability they report $X = 5$ is $500/2500 = 0.2$, $X = 20$ is

000/2500 = 0.4$, and $X = 100$ is
000/2500 = 0.4$. The naive sample average converges to: $E[X_{\text{biased}}] = 0.2 \cdot 5 + 0.4 \cdot 20 + 0.4 \cdot 100 = 1 + 8 + 40 = 49$ That is more than three times the true average of 15.6. The bias is severe and always upward.

Now check the harmonic mean estimator. In large samples, $\sum 1/X_i \approx n \cdot E[1/X_{\text{biased}}]$, and: $E\left[\frac{1}{X_{\text{biased}}}\right] = 0.2 \cdot \frac{1}{5} + 0.4 \cdot \frac{1}{20} + 0.4 \cdot \frac{1}{100} = 0.04 + 0.02 + 0.004 = 0.064$ So the harmonic mean gives

/0.064 = 15.625$, exactly the true average.

Approach: Formalize the size-biased distribution, show the bias in the sample mean, and derive the harmonic mean correction.

Formal Solution:

Part 1 -- Why the sample average is biased:

Let $f(s)$ be the fraction of buildings with $s$ residents. When you sample a person at random on the street, the probability they come from a building of size $s$ is proportional to $s \cdot f(s)$, because larger buildings contribute more people. The size-biased sampling distribution is: $g(s) = \frac{s \cdot f(s)}{\mu}, \quad \text{where } \mu = \sum_s s \cdot f(s) = E[S]$ is the true mean building size. Under this biased distribution, the expected value of a sampled $X_i$ is: $E_g[X] = \sum_s s \cdot g(s) = \frac{1}{\mu}\sum_s s^2 \cdot f(s) = \frac{E[S^2]}{E[S]}$ By Jensen's inequality (since $s^2$ is strictly convex), $E[S^2] > (E[S])^2$ whenever $S$ is not constant, so: $E_g[X] = \frac{E[S^2]}{E[S]} > E[S] = \mu$ The bias is always upward. The amount of overestimation is $\text{Var}(S)/E[S]$, since $E[S^2]/E[S] = E[S] + \text{Var}(S)/E[S]$.

Part 2 -- The harmonic mean correction:

Under the size-biased distribution $g(s) = s \cdot f(s) / \mu$, consider the random variable

/X$: $E_g\left[\frac{1}{X}\right] = \sum_s \frac{1}{s} \cdot g(s) = \sum_s \frac{1}{s} \cdot \frac{s \cdot f(s)}{\mu} = \frac{1}{\mu}\sum_s f(s) = \frac{1}{\mu}$ Therefore: $\mu = \frac{1}{E_g[1/X]}$ By the law of large numbers, $\frac{1}{n}\sum_{i=1}^n 1/X_i \xrightarrow{p} E_g[1/X] = 1/\mu$, so: $\hat{\mu} = \frac{n}{\sum_{i=1}^n 1/X_i} = \frac{1}{\frac{1}{n}\sum 1/X_i} \xrightarrow{p} \frac{1}{1/\mu} = \mu$ The harmonic mean is a consistent estimator of the true average building size.

Part 3 -- Intuitive explanation:

Each sampled person tells you their building size $X_i$. That person represents

/X_i$ of a building (one building shared among $X_i$ residents). Summing
/X_i$ over all $n$ people gives you the effective number of distinct buildings represented in the sample. Dividing the number of people $n$ by the number of buildings $\sum 1/X_i$ gives the average building size -- this is exactly the harmonic mean.

Answer: The naive sample average is biased upward by a factor of $E[S^2]/(E[S])^2$ because larger buildings contribute more people to the sample (size-biased sampling). The correct estimator is the harmonic mean $\hat{\mu} = n / \sum_{i=1}^n (1/X_i)$, which inverts the size bias by weighting each observation by

/X_i$. This converges to the true population average $\mu = E[S]$.

Intuition

Size-biased sampling is one of the most common traps in applied statistics, and it shows up constantly in quant work. Whenever your sampling probability is proportional to the thing you are trying to measure, your naive average will be biased upward. Surveying passengers about bus fullness, asking students about class size, or weighting portfolio returns by market cap all have this structure. The bias is not small -- it equals $\text{Var}(S)/E[S]$, so the more heterogeneous the population, the worse the overestimate.

The harmonic mean fix is a special case of importance weighting: you are sampling from distribution $g$ but want the mean under distribution $f$, so you reweight by $f(s)/g(s) = \mu/(s)$, which is proportional to

/s$. This same inverse-weighting idea is the foundation of importance sampling in Monte Carlo, inverse probability weighting in causal inference, and the Horvitz-Thompson estimator in survey statistics. Recognizing when your data comes from a biased sampling mechanism -- and knowing how to correct for it -- is one of the most practically valuable skills in quantitative work.

Open the full interactive solver →