Dirichlet-Multinomial Posterior and Prediction

Statistics · Medium · Free problem

You have $K$ categories with unknown probabilities $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$, where $\boldsymbol{\theta}$ has a symmetric Dirichlet prior $\text{Dir}(\alpha_1, \ldots, \alpha_K)$ with all $\alpha_i = 1$ (i.e., a uniform prior over the simplex).

You observe $N$ draws from the multinomial, yielding counts $n_1, \ldots, n_K$ (so $\sum_i n_i = N$).

What is the posterior distribution of $\boldsymbol{\theta}$ given the observed counts?

Derive the posterior predictive probability that the next $m$ independent draws produce counts $c_1, \ldots, c_K$ (with $\sum_i c_i = m$).

For a single next draw ($m = 1$), compute the predictive mean $E[X_i]$ and the predictive covariance $\text{Cov}(X_i, X_j)$, where $X_i$ is the indicator that the draw lands in category $i$.

Hints

The Dirichlet distribution is conjugate to the multinomial -- so the posterior has the same distributional form as the prior, just with updated parameters.
To get the posterior predictive, integrate the multinomial likelihood against the Dirichlet posterior. The key identity is $\int \prod \theta_i^{a_i - 1} d\boldsymbol{\theta} = B(\mathbf{a})$, the multivariate Beta function.
For a single future draw ($m = 1$), the predictive probability of category $i$ is just the posterior mean $E[\theta_i \mid \mathbf{n}]$. Use the law of total covariance to get $\text{Cov}(X_i, X_j)$.

Worked Solution

How to Think About It: This is the multivariate generalization of the Beta-Binomial story you already know. With $K = 2$ categories, Dirichlet reduces to Beta and multinomial reduces to binomial -- so everything here is the multi-category version of "coin with unknown bias." The Dirichlet is conjugate to the multinomial, so the posterior is another Dirichlet with updated parameters. The posterior predictive integrates out the unknown $\boldsymbol{\theta}$, which produces a Dirichlet-Multinomial distribution -- the multivariate analog of Beta-Binomial. If you can do the Beta-Binomial case, you can do this one by pattern-matching.

Quick Estimate: With a uniform prior ($\alpha_i = 1$) and observed counts, the posterior mean for category $i$ is $(n_i + 1)/(N + K)$ -- just add-one (Laplace) smoothing. For a single next draw, the predictive probability of category $i$ is this posterior mean. So if $K = 4$ and you observed counts $(10, 5, 3, 2)$ with $N = 20$, the predictive probabilities are $(11/24, 6/24, 4/24, 3/24) \approx (0.458, 0.250, 0.167, 0.125)$. The covariance between categories $i$ and $j$ should be negative (they compete for probability mass) and shrink toward zero as $N$ grows (you become more certain about $\boldsymbol{\theta}$).

Formal Derivation:

Part (i): Posterior of $\boldsymbol{\theta}$

The likelihood of observing counts $(n_1, \ldots, n_K)$ from a $\text{Multinomial}(N, \boldsymbol{\theta})$ is:

$L(\boldsymbol{\theta}) \propto \prod_{i=1}^{K} \theta_i^{n_i}$

The Dirichlet prior with $\alpha_i = 1$ is uniform on the simplex:

$p(\boldsymbol{\theta}) \propto \prod_{i=1}^{K} \theta_i^{\alpha_i - 1} = 1$

Multiplying likelihood by prior:

$p(\boldsymbol{\theta} \mid \mathbf{n}) \propto \prod_{i=1}^{K} \theta_i^{n_i}$

This is the kernel of a Dirichlet distribution, so:

$\boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(n_1 + 1, \, n_2 + 1, \, \ldots, \, n_K + 1)$

More generally, for any Dirichlet prior $\text{Dir}(\alpha_1, \ldots, \alpha_K)$, the posterior is $\text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$. This is the standard Dirichlet-Multinomial conjugacy result.

Part (ii): Posterior Predictive for $m$ Draws

We need to integrate out $\boldsymbol{\theta}$:

$P(\mathbf{c} \mid \mathbf{n}) = \int P(\mathbf{c} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathbf{n}) \, d\boldsymbol{\theta}$

The multinomial likelihood for the future counts is:

$P(\mathbf{c} \mid \boldsymbol{\theta}) = \binom{m}{c_1, \ldots, c_K} \prod_{i=1}^{K} \theta_i^{c_i}$

Writing the posterior in normalized form and combining the $\theta_i$ exponents:

$P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1, \ldots, c_K} \cdot \frac{1}{B(\tilde{\boldsymbol{\alpha}})} \int \prod_{i=1}^{K} \theta_i^{\tilde{\alpha}_i + c_i - 1} \, d\boldsymbol{\theta}$

where $\tilde{\alpha}_i = n_i + 1$ are the posterior parameters. The integral is just the multivariate Beta function $B(\tilde{\boldsymbol{\alpha}} + \mathbf{c})$, giving:

$P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1, \ldots, c_K} \frac{B(\tilde{\boldsymbol{\alpha}} + \mathbf{c})}{B(\tilde{\boldsymbol{\alpha}})}$

where $B(\mathbf{a}) = \frac{\prod_{i=1}^{K} \Gamma(a_i)}{\Gamma(\sum_i a_i)}$. Expanding with $\tilde{\alpha}_i = n_i + 1$ and $\tilde{\alpha}_0 = \sum_i \tilde{\alpha}_i = N + K$:

$P(\mathbf{c} \mid \mathbf{n}) = \frac{m!}{\prod_i c_i!} \cdot \frac{\Gamma(N+K)}{\Gamma(N+K+m)} \cdot \prod_{i=1}^{K} \frac{\Gamma(n_i + 1 + c_i)}{\Gamma(n_i + 1)}$

Since $\Gamma(n+1) = n!$, this simplifies to:

$P(\mathbf{c} \mid \mathbf{n}) = \frac{m!}{\prod_i c_i!} \cdot \frac{(N+K-1)!}{(N+K+m-1)!} \cdot \prod_{i=1}^{K} \frac{(n_i + c_i)!}{n_i!}$

This is the Dirichlet-Multinomial (or Polya) distribution -- the multivariate generalization of Beta-Binomial.

Part (iii): Predictive Mean and Covariance for a Single Draw

For $m = 1$, let $X_i \in \{0, 1\}$ indicate the draw lands in category $i$. Define $\hat{p}_i = \tilde{\alpha}_i / \tilde{\alpha}_0 = (n_i + 1)/(N + K)$.

*Mean:* By the tower property, $E[X_i] = E[E[X_i \mid \boldsymbol{\theta}]] = E[\theta_i] = \hat{p}_i$.

$E[X_i] = \frac{n_i + 1}{N + K}$

This is Laplace smoothing -- each category gets one pseudocount added.

*Covariance ($i \neq j$):* Apply the law of total covariance:

$\text{Cov}(X_i, X_j) = E[\text{Cov}(X_i, X_j \mid \boldsymbol{\theta})] + \text{Cov}(E[X_i \mid \boldsymbol{\theta}], E[X_j \mid \boldsymbol{\theta}])$

Conditional on $\boldsymbol{\theta}$, a single draw is categorical, so $\text{Cov}(X_i, X_j \mid \boldsymbol{\theta}) = -\theta_i \theta_j$ and $E[X_i \mid \boldsymbol{\theta}] = \theta_i$. Therefore:

$\text{Cov}(X_i, X_j) = -E[\theta_i \theta_j] + \text{Cov}(\theta_i, \theta_j)$

From standard Dirichlet moment formulas:

$E[\theta_i \theta_j] = \frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)}, \qquad \text{Cov}(\theta_i, \theta_j) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}(\tilde{\alpha}_0 + 1)}$

Substituting:

$\text{Cov}(X_i, X_j) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)} - \frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}(\tilde{\alpha}_0 + 1)}$

$= -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)} \left(1 + \frac{1}{\tilde{\alpha}_0}\right) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}} = -\hat{p}_i \, \hat{p}_j$

*Variance:* Similarly, $\text{Var}(X_i) = E[\theta_i(1-\theta_i)] + \text{Var}(\theta_i)$. Using Dirichlet moments this gives $\text{Var}(X_i) = \hat{p}_i(1 - \hat{p}_i)$.

These are the same formulas as a plain categorical distribution with probabilities $\hat{p}_i$. This makes sense: the Dirichlet-Multinomial over-dispersion factor is $(\tilde{\alpha}_0 + m)/(\tilde{\alpha}_0 + 1)$, which equals 1 when $m = 1$. Over-dispersion only kicks in when you predict multiple future draws.

Answer:

(i) $\boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(n_1+1, \ldots, n_K+1)$

(ii) The posterior predictive is Dirichlet-Multinomial: $P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1,\ldots,c_K} \frac{B(\mathbf{n}+\mathbf{1}+\mathbf{c})}{B(\mathbf{n}+\mathbf{1})}$

(iii) With $\hat{p}_i = (n_i+1)/(N+K)$: $E[X_i] = \hat{p}_i$, $\text{Var}(X_i) = \hat{p}_i(1-\hat{p}_i)$, $\text{Cov}(X_i,X_j) = -\hat{p}_i \hat{p}_j$ for $i \neq j$.

Intuition

The Dirichlet-Multinomial is the workhorse model for "unknown category probabilities" problems, and it shows up constantly in quant work: market-making on multi-outcome events, modeling sector allocation uncertainty, or anywhere you need to reason about a probability vector you don't fully know. The core pattern is simple -- Dirichlet is conjugate to multinomial, so you just add observed counts to the prior parameters. Everything else (posterior predictive, moments, over-dispersion) follows from that conjugacy.

The subtle point people miss is the over-dispersion story. When you average over uncertainty in $\boldsymbol{\theta}$, future draws become positively correlated with past draws (if you saw a lot of category 1, category 1 is more likely next time -- the "Polya urn" effect). For a single next draw this does not inflate the variance beyond the standard categorical, but for $m > 1$ future draws the variance is inflated by a factor $(\tilde{\alpha}_0 + m)/(\tilde{\alpha}_0 + 1)$ compared to a multinomial with known probabilities. This is the multi-category version of the Beta-Binomial over-dispersion, and it matters whenever you are sizing bets on multi-outcome events with limited data.

Open the full interactive solver →