Cross-Entropy, KL Divergence, and Jensen's Inequality

Statistics · Medium · Free problem

Consider two discrete probability distributions $p(x)$ and $q(x)$ defined over the same sample space.

  1. Write down the formula for the cross-entropy $H(p, q)$. Explain what each part represents and give an information-theoretic interpretation.
  1. The KL divergence (Kullback-Leibler divergence) between $p$ and $q$ is defined as

$D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$

Using Jensen's inequality, prove that $D_{KL}(p \| q) \geq 0$ for all valid distributions $p$ and $q$. When does equality hold?

  1. Explain how the non-negativity of KL divergence connects cross-entropy to entropy, and why this matters for machine learning loss functions.

Hints

  1. Think about what happens when you use the "wrong" codebook to compress data. How does cross-entropy relate to entropy plus some penalty term?
  2. To prove $D_{KL} \geq 0$, rewrite the divergence as $-E_p[\log(q(x)/p(x))]$ and recall that $-\log$ is a convex function. What does Jensen's inequality say about the expectation of a convex function?
  3. Compute $E_p[q(x)/p(x)]$ directly -- the $p(x)$ terms cancel, leaving $\sum_x q(x) = 1$. Now apply Jensen to get $-\log(1) = 0$ as the lower bound.

Worked Solution

How to Think About It: Cross-entropy and KL divergence are the workhorses of information theory that show up everywhere in ML and quantitative modeling. The core idea is simple: entropy $H(p)$ measures how many bits you need to encode data from $p$ using the *best possible* code. Cross-entropy $H(p, q)$ measures how many bits you need if you use a code optimized for the *wrong* distribution $q$ instead. The difference -- the "wasted bits" from using $q$ instead of $p$ -- is exactly the KL divergence. Jensen's inequality gives us the clean proof that you can never do *better* than the optimal code, i.e., the KL divergence is always non-negative.

Quick Estimate: Before any formulas, think about it intuitively. If $p$ and $q$ are close, the coding penalty should be small. If they are identical, the penalty should be zero. And it should never be negative -- you cannot beat the optimal code. That is exactly what $D_{KL} \geq 0$ says.

Formal Solution:

Part 1: Cross-Entropy Formula

The cross-entropy between $p$ and $q$ is:

$H(p, q) = -\sum_x p(x) \log q(x)$

Breaking this down: - $p(x)$ is the true distribution -- the actual probability of event $x$. - $\log q(x)$ is the log-probability assigned to $x$ by the model distribution $q$. This represents the "code length" that $q

s optimal code assigns to $x$. - The sum $-\sum_x p(x) \log q(x)$ is the expected code length when the true data comes from $p$ but you encode it using $q
s code.

Information-theoretic interpretation: if you design a compression scheme based on $q$ but the data actually follows $p$, cross-entropy is the average number of bits per symbol you will use.

Cross-entropy decomposes as:

$H(p, q) = H(p) + D_{KL}(p \| q)$

where $H(p) = -\sum_x p(x) \log p(x)$ is the entropy of $p$ (the irreducible minimum bits needed), and $D_{KL}(p \| q)$ is the extra cost from the mismatch.

Part 2: Non-Negativity of KL Divergence

We want to show $D_{KL}(p \| q) \geq 0$. Start by rewriting the KL divergence:

$D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = -\sum_x p(x) \log \frac{q(x)}{p(x)}$

Now apply Jensen's inequality. Since $-\log$ is a convex function, Jensen's inequality states that for a convex function $f$ and a random variable $Y$:

$f\bigl(E[Y]\bigr) \leq E\bigl[f(Y)\bigr]$

Let $Y = q(x)/p(x)$, where $x$ is drawn from $p$. Then $E_p[Y] = \sum_x p(x) \cdot \frac{q(x)}{p(x)} = \sum_x q(x) = 1$.

Applying Jensen's inequality to $f(t) = -\log(t)$:

$D_{KL}(p \| q) = E_p\!\left[-\log \frac{q(x)}{p(x)}\right] \geq -\log\!\left(E_p\!\left[\frac{q(x)}{p(x)}\right]\right) = -\log(1) = 0$

Therefore $D_{KL}(p \| q) \geq 0$. $\square$

Equality condition: Jensen's inequality is tight if and only if $q(x)/p(x)$ is constant $p$-almost everywhere, which (combined with both summing to 1) forces $q(x) = p(x)$ for all $x$.

Part 3: Connection to ML Loss Functions

Since $D_{KL}(p \| q) \geq 0$, we have:

$H(p, q) = H(p) + D_{KL}(p \| q) \geq H(p)$

Cross-entropy is minimized exactly when $q = p$, i.e., when the model distribution matches the true distribution. In classification, the true distribution $p$ is fixed (it is the label), so minimizing cross-entropy over $q$ is equivalent to minimizing $D_{KL}(p \| q)$. This is why cross-entropy loss is the standard objective for training classifiers -- it directly drives the model toward the true distribution.

Answer: The cross-entropy is $H(p, q) = -\sum_x p(x) \log q(x)$, measuring the expected bits to encode $p$-distributed data using $q

s code. KL divergence $D_{KL}(p \| q) \geq 0$ follows from Jensen's inequality applied to $-\log(q(x)/p(x))$ under $p$, using $E_p[q(x)/p(x)] = 1$. Equality holds iff $p = q$. This guarantees $H(p, q) \geq H(p)$, which is why minimizing cross-entropy loss forces the model distribution toward the truth.

Intuition

The non-negativity of KL divergence is one of the most useful results in information theory, and the proof via Jensen's inequality is a classic "one-liner" that every quant and ML practitioner should know cold. The core insight is that you cannot beat the optimal code: if nature generates data from $p$, any code designed for a different distribution $q$ will waste bits. The KL divergence measures exactly how many bits you waste, and Jensen's inequality -- applied to the convex function $-\log$ -- guarantees this waste is never negative.

In practice, this result underpins why cross-entropy is the go-to loss function for classification. When you train a neural network with cross-entropy loss, you are implicitly minimizing the KL divergence between the empirical label distribution and the model's predicted distribution. The same principle shows up in quantitative finance whenever you fit a model to observed probabilities -- for example, calibrating risk-neutral distributions from option prices or fitting mixture models to return data. Any time you measure the "distance" between a model and reality using log-likelihood, you are working with KL divergence under the hood.

Open the full interactive solver →