Rademacher Complexity: Scaling and Shift Invariance

Machine Learning · Medium · Free problem

Rademacher complexity is a fundamental tool in statistical learning theory for measuring the richness of a function class.

  1. Define the (empirical) Rademacher complexity $R_n(F)$ for a function class $F$ evaluated on a sample $x_1, \dots, x_n$.
  1. Suppose $F_2 = aF_1 + b$, meaning $F_2 = \{x \mapsto a f(x) + b : f \in F_1\}$ for constants $a, b \in \mathbb{R}$. Show that $R_n(F_2) = |a|\, R_n(F_1)$. In particular, explain why the Rademacher complexity is completely independent of the shift $b$.

Hints

  1. Start by writing out the definition of Rademacher complexity and substituting $af(x_i) + b$ in place of $f(x_i)$. What terms can you separate?
  2. The $b$ term does not depend on which $f$ you choose, so it factors out of the supremum. What is $\mathbb{E}[\sum \sigma_i]$?
  3. For the $|a|$ part when $a < 0$: pulling a negative constant out of a sup turns it into an inf. Use the fact that $-\sigma_i$ has the same distribution as $\sigma_i$ to convert back.

Worked Solution

How to Think About It: Rademacher complexity asks: how well can functions in $F$ correlate with random noise? You draw random signs $\sigma_i \in \{-1, +1\}$ and see how large $\sum \sigma_i f(x_i)$ can get by choosing the best $f \in F$. A richer class can fit noise better, so its Rademacher complexity is higher. The key intuition for this problem is that adding a constant $b$ to every function does not help you correlate with noise (because the noise signs are symmetric and sum to zero in expectation), while scaling by $a$ simply scales the correlation.

Key Insight: The Rademacher variables $\sigma_i$ are symmetric -- they are equally likely to be $+1$ or $-1$. This symmetry kills any constant shift and turns a negative scale factor into its absolute value.

Formal Derivation:

*Part 1 -- Definition:*

The empirical Rademacher complexity of a function class $F$ on a sample $x_1, \dots, x_n$ is:

$R_n(F) = \mathbb{E}_{\sigma}\left[\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(x_i)\right]$

where $\sigma_1, \dots, \sigma_n$ are i.i.d. Rademacher random variables, each taking values $+1$ or $-1$ with equal probability.

*Part 2 -- Scaling and shift:*

Let $F_2 = \{x \mapsto a f(x) + b : f \in F_1\}$. Then:

$R_n(F_2) = \mathbb{E}_{\sigma}\left[\sup_{f \in F_1} \frac{1}{n} \sum_{i=1}^{n} \sigma_i \bigl(a f(x_i) + b\bigr)\right]$

Split the sum:

$= \mathbb{E}_{\sigma}\left[\sup_{f \in F_1} \left(\frac{a}{n} \sum_{i=1}^{n} \sigma_i f(x_i) + \frac{b}{n} \sum_{i=1}^{n} \sigma_i\right)\right]$

The term $\frac{b}{n} \sum_{i=1}^{n} \sigma_i$ does not depend on $f$, so it factors out of the supremum:

$= \mathbb{E}_{\sigma}\left[\frac{b}{n} \sum_{i=1}^{n} \sigma_i + \sup_{f \in F_1} \frac{a}{n} \sum_{i=1}^{n} \sigma_i f(x_i)\right]$

By linearity of expectation, the first term becomes $\frac{b}{n} \sum_{i=1}^{n} \mathbb{E}[\sigma_i] = 0$, since each $\sigma_i$ has mean zero. So:

$R_n(F_2) = \mathbb{E}_{\sigma}\left[\sup_{f \in F_1} \frac{a}{n} \sum_{i=1}^{n} \sigma_i f(x_i)\right]$

This shows $R_n(F_2)$ is independent of $b$. The constant shift adds a term to the supremum that does not depend on the choice of $f$, and that term has mean zero by symmetry of the Rademacher variables.

Now handle the scale factor $a$. If $a \geq 0$, it pulls out of the supremum directly:

$R_n(F_2) = a \cdot \mathbb{E}_{\sigma}\left[\sup_{f \in F_1} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(x_i)\right] = a \cdot R_n(F_1)$

If $a < 0$, pulling $a$ out flips the supremum to an infimum:

$R_n(F_2) = |a| \cdot \mathbb{E}_{\sigma}\left[\inf_{f \in F_1} \frac{1}{n} \sum_{i=1}^{n} (-\sigma_i) f(x_i)\right]$

But since $\sigma_i$ and $-\sigma_i$ have the same distribution (Rademacher symmetry), we can replace $-\sigma_i$ with $\sigma_i$ inside the expectation. And $\inf_f g(f) = -\sup_f (-g(f))$... but more directly: the distribution of $\sup_f \frac{1}{n}\sum (-\sigma_i) f(x_i)$ is the same as the distribution of $\sup_f \frac{1}{n}\sum \sigma_i f(x_i)$, because $-\sigma \stackrel{d}{=} \sigma$. Therefore:

$R_n(F_2) = |a| \cdot R_n(F_1)$

Answer: The empirical Rademacher complexity satisfies $R_n(aF_1 + b) = |a|\, R_n(F_1)$. The shift $b$ drops out because it adds a term to the supremum that does not depend on $f$ and has zero mean by Rademacher symmetry. The scale factor $a$ comes out as $|a|$ because flipping the sign of $a$ is equivalent to flipping the signs of the $\sigma_i$, which does not change their distribution.

Intuition

Rademacher complexity measures how well a function class can fit pure noise. Adding a constant to every function does not help you fit noise -- it just shifts all outputs by the same amount, and random signs average that shift to zero. Scaling every function by $a$ scales your ability to correlate with noise proportionally, and flipping the sign does not matter because the noise itself is symmetric. This is why $R_n(aF + b) = |a| R_n(F)$: the complexity cares only about the "shape" of the function class and its scale, not about any global offset.

In practice, this result tells you that preprocessing steps like centering your data (subtracting a constant) or rescaling features do not change the fundamental complexity of your model class -- only the scale matters. This is reassuring: it means generalization bounds based on Rademacher complexity are invariant to the kinds of innocent transformations you routinely apply to data. The deeper lesson is that complexity in learning theory is about the diversity of patterns a class can express, not about the absolute level of its outputs.

Open the full interactive solver →