Cross-Entropy, KL Divergence, and Jensen's Inequality
Consider two discrete probability distributions $p(x)$ and $q(x)$ defined over the same sample space.
- Write down the formula for the cross-entropy $H(p, q)$. Explain what each part represents and give an information-theoretic interpretation.
- The KL divergence (Kullback-Leibler divergence) between $p$ and $q$ is defined as
$D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$
Using Jensen's inequality, prove that $D_{KL}(p \| q) \geq 0$ for all valid distributions $p$ and $q$. When does equality hold?
- Explain how the non-negativity of KL divergence connects cross-entropy to entropy, and why this matters for machine learning loss functions.
Hints
- Think about what happens when you use the "wrong" codebook to compress data. How does cross-entropy relate to entropy plus some penalty term?
- To prove $D_{KL} \geq 0$, rewrite the divergence as $-E_p[\log(q(x)/p(x))]$ and recall that $-\log$ is a convex function. What does Jensen's inequality say about the expectation of a convex function?
- Compute $E_p[q(x)/p(x)]$ directly -- the $p(x)$ terms cancel, leaving $\sum_x q(x) = 1$. Now apply Jensen to get $-\log(1) = 0$ as the lower bound.
Worked Solution
How to Think About It: Cross-entropy and KL divergence are the workhorses of information theory that show up everywhere in ML and quantitative modeling. The core idea is simple: entropy $H(p)$ measures how many bits you need to encode data from $p$ using the *best possible* code. Cross-entropy $H(p, q)$ measures how many bits you need if you use a code optimized for the *wrong* distribution $q$ instead. The difference -- the "wasted bits" from using $q$ instead of $p$ -- is exactly the KL divergence. Jensen's inequality gives us the clean proof that you can never do *better* than the optimal code, i.e., the KL divergence is always non-negative.
Quick Estimate: Before any formulas, think about it intuitively. If $p$ and $q$ are close, the coding penalty should be small. If they are identical, the penalty should be zero. And it should never be negative -- you cannot beat the optimal code. That is exactly what $D_{KL} \geq 0$ says.
Formal Solution:
Part 1: Cross-Entropy Formula
The cross-entropy between $p$ and $q$ is:
$H(p, q) = -\sum_x p(x) \log q(x)$
Breaking this down: - $p(x)$ is the true distribution -- the actual probability of event $x$. - $\log q(x)$ is the log-probability assigned to $x$ by the model distribution $q$. This represents the "code length" that $q