Linear MMSE Estimator Derivation

Expectation · Medium · Free problem

You have two random variables $X$ and $Y$ with finite second moments. You want to predict $X$ using a linear function of $Y$ -- that is, a predictor of the form $\hat{X}_L = aY + b$ for some constants $a$ and $b$.

1. Find the constants $a$ and $b$ that minimize the mean squared error $\mathrm{MSE} = E[(X - \hat{X}_L)^2]$, and show that the optimal linear estimator is $\hat{X}_L = E[X] + \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(Y)}\,(Y - E[Y]).$

2. Show that the resulting minimum MSE satisfies $\mathrm{MSE} = (1 - \rho^2)\,\mathrm{Var}(X),$ where $\rho = \mathrm{Corr}(X, Y)$.

Prove the orthogonality principle: the estimation error $\epsilon = X - \hat{X}_L$ is uncorrelated with $Y$, i.e., $\mathrm{Cov}(\epsilon, Y) = 0$.

Hints

The MSE objective $E[(X - aY - b)^2]$ is a quadratic in both $a$ and $b$ -- optimize each separately. The optimal intercept $b$ depends only on means; once you substitute it in, the slope $a$ pops out from a single first-order condition.
After substituting the optimal $b$, the MSE becomes $\mathrm{Var}(X) - 2a\,\mathrm{Cov}(X,Y) + a^2\,\mathrm{Var}(Y)$. This is a quadratic in $a$ -- minimize it directly to get $a^{*} = \mathrm{Cov}(X,Y)/\mathrm{Var}(Y)$.
For orthogonality, compute $\mathrm{Cov}(\epsilon, Y) = \mathrm{Cov}(X - a^{*}Y - b^{*}, Y)$ directly. Expand using linearity of covariance and substitute the definition of $a^{*}$ -- two terms cancel exactly by construction.

Worked Solution

How to Think About It: You are fitting the best linear approximation to a conditional expectation. The key move is to separate the problem into two independent optimizations: first nail down the intercept $b$ (which just centers things), then optimize the slope $a$ (which captures how much $Y$ tells you about $X$). This is the same structure as ordinary least squares regression -- the linear MMSE estimator IS the population regression of $X$ on $Y$. Before doing any algebra, the answer should feel obvious: the slope should scale with how correlated $X$ and $Y$ are, relative to how much $Y$ varies. If $\mathrm{Cov}(X,Y) = 0$, the best you can do is predict $E[X]$ and ignore $Y$ entirely.

Quick Sanity Check: Consider the extremes. If $\rho = 1$ (perfect positive correlation), the MSE should drop to zero -- you can perfectly predict $X$ from $Y$. Our formula gives $\mathrm{MSE} = (1 - 1)\mathrm{Var}(X) = 0$. Correct. If $\rho = 0$ (no correlation), $Y$ is useless and $\mathrm{MSE} = \mathrm{Var}(X)$ -- we just predict the mean. Also correct. The slope $a = \mathrm{Cov}(X,Y)/\mathrm{Var}(Y)$ is exactly the population OLS coefficient, which is the right magnitude check.

Approach: Jointly minimize $E[(X - aY - b)^2]$ over $(a, b)$ by taking partial derivatives and setting them to zero. The two first-order conditions decouple cleanly.

Formal Solution:

Part 1 -- Deriving $a$ and $b$:

Expand the MSE as a function of $a$ and $b$: $f(a, b) = E[(X - aY - b)^2].$

First-order condition for $b$: differentiate and set to zero. $\frac{\partial f}{\partial b} = -2\,E[X - aY - b] = 0 \implies b = E[X] - a\,E[Y].$

Substitute this back. Define centered variables $\tilde{X} = X - E[X]$ and $\tilde{Y} = Y - E[Y]$. The objective reduces to: $f(a) = E[(\tilde{X} - a\tilde{Y})^2] = \mathrm{Var}(X) - 2a\,\mathrm{Cov}(X,Y) + a^2\,\mathrm{Var}(Y).$

This is a quadratic in $a$, opening upward. Minimize by differentiating: $\frac{df}{da} = -2\,\mathrm{Cov}(X,Y) + 2a\,\mathrm{Var}(Y) = 0 \implies a = \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(Y)}.$

Plugging both constants back in: $\hat{X}_L = E[X] + \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(Y)}\,(Y - E[Y]).$

Part 2 -- Minimum MSE:

Substitute $a^{*} = \mathrm{Cov}(X,Y)/\mathrm{Var}(Y)$ into the quadratic expression for $f(a)$: $\mathrm{MSE} = \mathrm{Var}(X) - 2a^{*}\,\mathrm{Cov}(X,Y) + (a^{*})^2\,\mathrm{Var}(Y).$

Simplify using $a^{*}\,\mathrm{Var}(Y) = \mathrm{Cov}(X,Y)$: $\mathrm{MSE} = \mathrm{Var}(X) - 2\,\frac{[\mathrm{Cov}(X,Y)]^2}{\mathrm{Var}(Y)} + \frac{[\mathrm{Cov}(X,Y)]^2}{\mathrm{Var}(Y)} = \mathrm{Var}(X) - \frac{[\mathrm{Cov}(X,Y)]^2}{\mathrm{Var}(Y)}.$

Recall that $\rho^2 = [\mathrm{Cov}(X,Y)]^2 / (\mathrm{Var}(X)\,\mathrm{Var}(Y))$, so $[\mathrm{Cov}(X,Y)]^2/\mathrm{Var}(Y) = \rho^2\,\mathrm{Var}(X)$. Therefore: $\mathrm{MSE} = (1 - \rho^2)\,\mathrm{Var}(X).$

Part 3 -- Orthogonality Principle:

The error is $\epsilon = X - \hat{X}_L = X - E[X] - a^{*}(Y - E[Y]) = \tilde{X} - a^{*}\tilde{Y}$.

Compute the covariance of the error with $Y$ (equivalently, with $\tilde{Y}$): $\mathrm{Cov}(\epsilon, Y) = E[\epsilon \tilde{Y}] = E[(\tilde{X} - a^{*}\tilde{Y})\tilde{Y}] = \mathrm{Cov}(X,Y) - a^{*}\,\mathrm{Var}(Y).$

Substituting $a^{*} = \mathrm{Cov}(X,Y)/\mathrm{Var}(Y)$: $\mathrm{Cov}(\epsilon, Y) = \mathrm{Cov}(X,Y) - \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(Y)}\cdot\mathrm{Var}(Y) = 0. \quad \square$

Answer: - Optimal slope: $a^{*} = \mathrm{Cov}(X,Y)/\mathrm{Var}(Y)$; optimal intercept: $b^{*} = E[X] - a^{*}E[Y]$ - Linear MMSE estimator: $\hat{X}_L = E[X] + \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(Y)}(Y - E[Y])$ - Minimum MSE: $(1 - \rho^2)\,\mathrm{Var}(X)$ - Orthogonality: $\mathrm{Cov}(X - \hat{X}_L,\, Y) = 0$

Intuition

The linear MMSE estimator is the population version of ordinary least squares regression. When you run OLS of $X$ on $Y$ in a sample, you are estimating this formula -- and the orthogonality principle (residuals uncorrelated with regressors) is just the population analog of the normal equations. So this result is not a curiosity: it is the theoretical foundation for why OLS works as well as it does. The fraction $\rho^2$ of variance explained by the regression is exactly

- \mathrm{MSE}/\mathrm{Var}(X)$, which is the $R^2$ of the regression.

The orthogonality principle has a deep geometric interpretation: $\hat{X}_L$ is the orthogonal projection of $X$ onto the linear subspace spanned by $\{1, Y\}$ in $L^2$ (the space of square-integrable random variables, with inner product $\langle U, V \rangle = E[UV]$). The fact that the error is orthogonal to $Y$ is just the projection theorem -- the residual from a projection is always orthogonal to the subspace. This geometric view generalizes immediately: the best linear predictor using $n$ variables $Y_1, \ldots, Y_n$ is the projection onto the span of $\{1, Y_1, \ldots, Y_n\}$, which is why multiple regression inherits the same orthogonality structure.

Open the full interactive solver →