OLS Estimator Derivation

Regression · Medium · Free problem

Consider the standard linear model $Y = X\beta + \varepsilon$, where $Y$ is an $n \times 1$ response vector, $X$ is an $n \times p$ design matrix with full column rank, $\beta$ is a $p \times 1$ parameter vector, and $\varepsilon \sim N(0, \sigma^2 I_n)$.

Derive the ordinary least squares (OLS) estimator $\hat{\beta}$ by minimizing the sum of squared residuals $\|Y - X\beta\|^2$. Specifically:

  1. Write the objective function in matrix form, take the gradient with respect to $\beta$, and solve the resulting normal equations.
  1. Show that $\hat{\beta}$ is unbiased and derive its covariance matrix.
  1. What goes wrong when $X^T X$ is singular, and how would you fix it in practice?

Hints

  1. Think geometrically: OLS projects $Y$ onto the column space of $X$. What does orthogonality of the residual to $\text{col}(X)$ imply?
  2. Expand $(Y - X\beta)^T(Y - X\beta)$ as a quadratic form in $\beta$ and use the matrix derivative $\nabla_{\beta}(\beta^T A \beta) = 2A\beta$ for symmetric $A$.
  3. After solving the normal equations $X^T X \hat{\beta} = X^T Y$, substitute $Y = X\beta + \varepsilon$ into the formula for $\hat{\beta}$ to show unbiasedness and compute $\text{Var}(\hat{\beta})$ using $\text{Var}(\varepsilon) = \sigma^2 I$.

Worked Solution

How to Think About It: OLS is the workhorse of quantitative finance -- you will use it constantly for factor models, signal construction, and regression-based hedging. The derivation is pure multivariable calculus: you have a convex quadratic in $\beta$, so you take the gradient, set it to zero, and solve. The key object is the $p \times p$ matrix $X^T X$ (the Gram matrix). If it's invertible, you get a unique closed-form solution. If it's not, you have collinear features and need to regularize. Know this derivation cold -- it comes up in interviews at every level.

Quick Estimate: Before any calculus, think geometrically. OLS projects $Y$ onto the column space of $X$. The fitted values $\hat{Y} = X\hat{\beta}$ are the closest point in $\text{col}(X)$ to $Y$, so the residual $Y - \hat{Y}$ must be orthogonal to every column of $X$. That orthogonality condition is exactly $X^T(Y - X\hat{\beta}) = 0$, which gives you the normal equations immediately -- no calculus needed.

Approach: We derive the estimator via calculus (expanding the quadratic form and differentiating), then verify properties.

Formal Solution:

*Part 1: The estimator*

The sum of squared residuals is:

$\text{SSR}(\beta) = (Y - X\beta)^T(Y - X\beta) = Y^T Y - 2\beta^T X^T Y + \beta^T X^T X \beta$

This is a convex quadratic in $\beta$ (since $X^T X$ is positive semi-definite). Taking the gradient:

$\nabla_{\beta} \text{SSR} = -2X^T Y + 2X^T X \beta$

Setting equal to zero gives the normal equations:

$X^T X \hat{\beta} = X^T Y$

Since $X$ has full column rank, $X^T X$ is positive definite and invertible, so:

$\hat{\beta} = (X^T X)^{-1} X^T Y$

To confirm this is a minimum (not a maximum or saddle point), note the Hessian is

X^T X$, which is positive definite -- so the critical point is a global minimum.

*Part 2: Unbiasedness and covariance*

Substitute $Y = X\beta + \varepsilon$:

$\hat{\beta} = (X^T X)^{-1} X^T (X\beta + \varepsilon) = \beta + (X^T X)^{-1} X^T \varepsilon$

Taking expectations (treating $X$ as fixed):

$E[\hat{\beta}] = \beta + (X^T X)^{-1} X^T E[\varepsilon] = \beta$

So $\hat{\beta}$ is unbiased. For the covariance:

$\text{Var}(\hat{\beta}) = (X^T X)^{-1} X^T \text{Var}(\varepsilon) \, X (X^T X)^{-1} = \sigma^2 (X^T X)^{-1}$

where the simplification uses $\text{Var}(\varepsilon) = \sigma^2 I$. By the Gauss-Markov theorem, $\hat{\beta}$ is BLUE -- the Best Linear Unbiased Estimator -- meaning no other linear unbiased estimator has smaller variance.

*Part 3: Singular $X^T X$*

When $X^T X$ is singular, the columns of $X$ are linearly dependent (multicollinearity). The normal equations still have solutions, but they are not unique -- there is a family of $\hat{\beta}$ values that produce the same fitted values $\hat{Y}$. In practice:

Answer: The OLS estimator is $\hat{\beta} = (X^T X)^{-1} X^T Y$, obtained by setting the gradient of the quadratic loss to zero. It is unbiased with $\text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}$, and is BLUE by Gauss-Markov. When $X^T X$ is singular, ridge regression $(X^T X + \lambda I)^{-1} X^T Y$ restores a unique, well-conditioned solution at the cost of a small bias.

Intuition

OLS is really just an orthogonal projection. The fitted values $\hat{Y} = X(X^T X)^{-1}X^T Y$ are the projection of $Y$ onto the column space of $X$, and the hat matrix $H = X(X^T X)^{-1}X^T$ is a projection matrix ($H^2 = H$, $H^T = H$). The residual vector $e = Y - \hat{Y}$ is perpendicular to every column of $X$, which is exactly the content of the normal equations $X^T e = 0$. Once you see OLS as projection, everything else follows: unbiasedness is immediate, the covariance formula is a linear transformation of $\text{Var}(\varepsilon)$, and Gauss-Markov says no other linear projection can do better.

In practice, the fragility of OLS shows up in the condition number of $X^T X$. When features are nearly collinear, $(X^T X)^{-1}$ has huge entries and $\hat{\beta}$ becomes wildly unstable -- tiny changes in $Y$ produce large swings in coefficients. This is why quant teams almost always regularize: ridge regression, LASSO, or PCA-based dimension reduction. The derivation above is the starting point, but knowing when and why it breaks is what separates a practitioner from someone who just memorized the formula.

Open the full interactive solver →