Dirichlet-Multinomial Posterior and Prediction
You have $K$ categories with unknown probabilities $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$, where $\boldsymbol{\theta}$ has a symmetric Dirichlet prior $\text{Dir}(\alpha_1, \ldots, \alpha_K)$ with all $\alpha_i = 1$ (i.e., a uniform prior over the simplex).
You observe $N$ draws from the multinomial, yielding counts $n_1, \ldots, n_K$ (so $\sum_i n_i = N$).
- What is the posterior distribution of $\boldsymbol{\theta}$ given the observed counts?
- Derive the posterior predictive probability that the next $m$ independent draws produce counts $c_1, \ldots, c_K$ (with $\sum_i c_i = m$).
- For a single next draw ($m = 1$), compute the predictive mean $E[X_i]$ and the predictive covariance $\text{Cov}(X_i, X_j)$, where $X_i$ is the indicator that the draw lands in category $i$.
Hints
- The Dirichlet distribution is conjugate to the multinomial -- so the posterior has the same distributional form as the prior, just with updated parameters.
- To get the posterior predictive, integrate the multinomial likelihood against the Dirichlet posterior. The key identity is $\int \prod \theta_i^{a_i - 1} d\boldsymbol{\theta} = B(\mathbf{a})$, the multivariate Beta function.
- For a single future draw ($m = 1$), the predictive probability of category $i$ is just the posterior mean $E[\theta_i \mid \mathbf{n}]$. Use the law of total covariance to get $\text{Cov}(X_i, X_j)$.
Worked Solution
How to Think About It: This is the multivariate generalization of the Beta-Binomial story you already know. With $K = 2$ categories, Dirichlet reduces to Beta and multinomial reduces to binomial -- so everything here is the multi-category version of "coin with unknown bias." The Dirichlet is conjugate to the multinomial, so the posterior is another Dirichlet with updated parameters. The posterior predictive integrates out the unknown $\boldsymbol{\theta}$, which produces a Dirichlet-Multinomial distribution -- the multivariate analog of Beta-Binomial. If you can do the Beta-Binomial case, you can do this one by pattern-matching.
Quick Estimate: With a uniform prior ($\alpha_i = 1$) and observed counts, the posterior mean for category $i$ is $(n_i + 1)/(N + K)$ -- just add-one (Laplace) smoothing. For a single next draw, the predictive probability of category $i$ is this posterior mean. So if $K = 4$ and you observed counts $(10, 5, 3, 2)$ with $N = 20$, the predictive probabilities are $(11/24, 6/24, 4/24, 3/24) \approx (0.458, 0.250, 0.167, 0.125)$. The covariance between categories $i$ and $j$ should be negative (they compete for probability mass) and shrink toward zero as $N$ grows (you become more certain about $\boldsymbol{\theta}$).
Formal Derivation:
Part (i): Posterior of $\boldsymbol{\theta}$
The likelihood of observing counts $(n_1, \ldots, n_K)$ from a $\text{Multinomial}(N, \boldsymbol{\theta})$ is:
$L(\boldsymbol{\theta}) \propto \prod_{i=1}^{K} \theta_i^{n_i}$
The Dirichlet prior with $\alpha_i = 1$ is uniform on the simplex:
$p(\boldsymbol{\theta}) \propto \prod_{i=1}^{K} \theta_i^{\alpha_i - 1} = 1$
Multiplying likelihood by prior:
$p(\boldsymbol{\theta} \mid \mathbf{n}) \propto \prod_{i=1}^{K} \theta_i^{n_i}$
This is the kernel of a Dirichlet distribution, so:
$\boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(n_1 + 1, \, n_2 + 1, \, \ldots, \, n_K + 1)$
More generally, for any Dirichlet prior $\text{Dir}(\alpha_1, \ldots, \alpha_K)$, the posterior is $\text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K)$. This is the standard Dirichlet-Multinomial conjugacy result.
Part (ii): Posterior Predictive for $m$ Draws
We need to integrate out $\boldsymbol{\theta}$:
$P(\mathbf{c} \mid \mathbf{n}) = \int P(\mathbf{c} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathbf{n}) \, d\boldsymbol{\theta}$
The multinomial likelihood for the future counts is:
$P(\mathbf{c} \mid \boldsymbol{\theta}) = \binom{m}{c_1, \ldots, c_K} \prod_{i=1}^{K} \theta_i^{c_i}$
Writing the posterior in normalized form and combining the $\theta_i$ exponents:
$P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1, \ldots, c_K} \cdot \frac{1}{B(\tilde{\boldsymbol{\alpha}})} \int \prod_{i=1}^{K} \theta_i^{\tilde{\alpha}_i + c_i - 1} \, d\boldsymbol{\theta}$
where $\tilde{\alpha}_i = n_i + 1$ are the posterior parameters. The integral is just the multivariate Beta function $B(\tilde{\boldsymbol{\alpha}} + \mathbf{c})$, giving:
$P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1, \ldots, c_K} \frac{B(\tilde{\boldsymbol{\alpha}} + \mathbf{c})}{B(\tilde{\boldsymbol{\alpha}})}$
where $B(\mathbf{a}) = \frac{\prod_{i=1}^{K} \Gamma(a_i)}{\Gamma(\sum_i a_i)}$. Expanding with $\tilde{\alpha}_i = n_i + 1$ and $\tilde{\alpha}_0 = \sum_i \tilde{\alpha}_i = N + K$:
$P(\mathbf{c} \mid \mathbf{n}) = \frac{m!}{\prod_i c_i!} \cdot \frac{\Gamma(N+K)}{\Gamma(N+K+m)} \cdot \prod_{i=1}^{K} \frac{\Gamma(n_i + 1 + c_i)}{\Gamma(n_i + 1)}$
Since $\Gamma(n+1) = n!$, this simplifies to:
$P(\mathbf{c} \mid \mathbf{n}) = \frac{m!}{\prod_i c_i!} \cdot \frac{(N+K-1)!}{(N+K+m-1)!} \cdot \prod_{i=1}^{K} \frac{(n_i + c_i)!}{n_i!}$
This is the Dirichlet-Multinomial (or Polya) distribution -- the multivariate generalization of Beta-Binomial.
Part (iii): Predictive Mean and Covariance for a Single Draw
For $m = 1$, let $X_i \in \{0, 1\}$ indicate the draw lands in category $i$. Define $\hat{p}_i = \tilde{\alpha}_i / \tilde{\alpha}_0 = (n_i + 1)/(N + K)$.
*Mean:* By the tower property, $E[X_i] = E[E[X_i \mid \boldsymbol{\theta}]] = E[\theta_i] = \hat{p}_i$.
$E[X_i] = \frac{n_i + 1}{N + K}$
This is Laplace smoothing -- each category gets one pseudocount added.
*Covariance ($i \neq j$):* Apply the law of total covariance:
$\text{Cov}(X_i, X_j) = E[\text{Cov}(X_i, X_j \mid \boldsymbol{\theta})] + \text{Cov}(E[X_i \mid \boldsymbol{\theta}], E[X_j \mid \boldsymbol{\theta}])$
Conditional on $\boldsymbol{\theta}$, a single draw is categorical, so $\text{Cov}(X_i, X_j \mid \boldsymbol{\theta}) = -\theta_i \theta_j$ and $E[X_i \mid \boldsymbol{\theta}] = \theta_i$. Therefore:
$\text{Cov}(X_i, X_j) = -E[\theta_i \theta_j] + \text{Cov}(\theta_i, \theta_j)$
From standard Dirichlet moment formulas:
$E[\theta_i \theta_j] = \frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)}, \qquad \text{Cov}(\theta_i, \theta_j) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}(\tilde{\alpha}_0 + 1)}$
Substituting:
$\text{Cov}(X_i, X_j) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)} - \frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}(\tilde{\alpha}_0 + 1)}$
$= -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0(\tilde{\alpha}_0 + 1)} \left(1 + \frac{1}{\tilde{\alpha}_0}\right) = -\frac{\tilde{\alpha}_i \tilde{\alpha}_j}{\tilde{\alpha}_0^{2}} = -\hat{p}_i \, \hat{p}_j$
*Variance:* Similarly, $\text{Var}(X_i) = E[\theta_i(1-\theta_i)] + \text{Var}(\theta_i)$. Using Dirichlet moments this gives $\text{Var}(X_i) = \hat{p}_i(1 - \hat{p}_i)$.
These are the same formulas as a plain categorical distribution with probabilities $\hat{p}_i$. This makes sense: the Dirichlet-Multinomial over-dispersion factor is $(\tilde{\alpha}_0 + m)/(\tilde{\alpha}_0 + 1)$, which equals 1 when $m = 1$. Over-dispersion only kicks in when you predict multiple future draws.
Answer:
- (i) $\boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(n_1+1, \ldots, n_K+1)$
- (ii) The posterior predictive is Dirichlet-Multinomial: $P(\mathbf{c} \mid \mathbf{n}) = \binom{m}{c_1,\ldots,c_K} \frac{B(\mathbf{n}+\mathbf{1}+\mathbf{c})}{B(\mathbf{n}+\mathbf{1})}$
- (iii) With $\hat{p}_i = (n_i+1)/(N+K)$: $E[X_i] = \hat{p}_i$, $\text{Var}(X_i) = \hat{p}_i(1-\hat{p}_i)$, $\text{Cov}(X_i,X_j) = -\hat{p}_i \hat{p}_j$ for $i \neq j$.
Intuition
The Dirichlet-Multinomial is the workhorse model for "unknown category probabilities" problems, and it shows up constantly in quant work: market-making on multi-outcome events, modeling sector allocation uncertainty, or anywhere you need to reason about a probability vector you don't fully know. The core pattern is simple -- Dirichlet is conjugate to multinomial, so you just add observed counts to the prior parameters. Everything else (posterior predictive, moments, over-dispersion) follows from that conjugacy.
The subtle point people miss is the over-dispersion story. When you average over uncertainty in $\boldsymbol{\theta}$, future draws become positively correlated with past draws (if you saw a lot of category 1, category 1 is more likely next time -- the "Polya urn" effect). For a single next draw this does not inflate the variance beyond the standard categorical, but for $m > 1$ future draws the variance is inflated by a factor $(\tilde{\alpha}_0 + m)/(\tilde{\alpha}_0 + 1)$ compared to a multinomial with known probabilities. This is the multi-category version of the Beta-Binomial over-dispersion, and it matters whenever you are sizing bets on multi-outcome events with limited data.