Least Squares Solution in Linear Regression

Regression · Medium · Free problem

You have a standard linear regression setup: an outcome vector $Y \in \mathbb{R}^n$, a design matrix $X \in \mathbb{R}^{n \times p}$, and a coefficient vector $\beta \in \mathbb{R}^p$. The model is $Y = X\beta + \varepsilon$.

Derive the ordinary least squares (OLS) estimator $\hat{\beta}$ that minimizes $\|Y - X\beta\|^2$. Under what conditions does a unique solution exist?

Now suppose several of your predictors are highly correlated (multicollinearity). What goes wrong with OLS, and how do you fix it?

You are receiving data one observation at a time in a streaming fashion. How do you update $\hat{\beta}$ without re-solving the entire system from scratch each time? Derive the recursive least squares (RLS) update.

Hints

Start by writing out the squared loss $\|Y - X\beta\|^2$, expanding it, and taking the derivative with respect to $\beta$. Think about what it means geometrically to project $Y$ onto the column space of $X$.
For multicollinearity, think about what happens to $(X^TX)^{-1}$ when $X^TX$ has eigenvalues near zero. Adding $\lambda I$ to $X^TX$ is the simplest fix -- why does it work?
For the online update, you need to avoid recomputing $(X^TX)^{-1}$ from scratch. The Sherman-Morrison formula lets you update a matrix inverse after a rank-1 perturbation in $O(p^2)$.

Worked Solution

How to Think About It: This is bread-and-butter linear algebra that comes up constantly in quant work -- fitting models, building signals, running regressions on returns. The core idea is dead simple: project $Y$ onto the column space of $X$. Everything else -- ridge, RLS -- is a practical patch for when the vanilla projection breaks or when you need it to run fast. An interviewer asking this wants to see you connect the algebra to the geometry and to real-world failure modes.

Part 1: The OLS Estimator

We want to minimize the sum of squared residuals:

$L(\beta) = \|Y - X\beta\|^2 = (Y - X\beta)^T(Y - X\beta)$

Expand and take the gradient with respect to $\beta$:

$\nabla_\beta L = -2X^TY + 2X^TX\beta$

Setting this to zero gives the normal equations:

$X^TX\hat{\beta} = X^TY$

If $X^TX$ is invertible (i.e., $X$ has full column rank, meaning $\text{rank}(X) = p$), we get the unique closed-form solution:

$\hat{\beta} = (X^TX)^{-1}X^TY$

Geometrically, $X\hat{\beta}$ is the orthogonal projection of $Y$ onto the column space of $X$. The residual $Y - X\hat{\beta}$ is perpendicular to every column of $X$, which is exactly what $X^T(Y - X\hat{\beta}) = 0$ says.

A unique solution exists if and only if $X^TX$ is invertible, which requires $n \geq p$ and no exact linear dependencies among the columns of $X$.

Part 2: Handling Multicollinearity

When predictors are highly correlated, $X^TX$ is close to singular. Its smallest eigenvalues are near zero, so $(X^TX)^{-1}$ blows up. The practical consequence: $\hat{\beta}$ has enormous variance. Individual coefficients become wildly unstable -- flip one data point and the coefficients swing dramatically, even though the fitted values barely change.

The standard fix is Ridge regression (Tikhonov regularization). Add a penalty $\lambda\|\beta\|^2$ to the objective:

$L_{\text{ridge}}(\beta) = \|Y - X\beta\|^2 + \lambda\|\beta\|^2$

The solution becomes:

$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^TY$

The $\lambda I$ term adds $\lambda$ to every eigenvalue of $X^TX$, so even if $X^TX$ has eigenvalues near zero, $(X^TX + \lambda I)$ is well-conditioned. This introduces bias (the coefficients are shrunk toward zero) but dramatically reduces variance. The trade-off is controlled by $\lambda$, typically chosen by cross-validation.

Other approaches include: - LASSO ($\ell_1$ penalty): shrinks some coefficients exactly to zero, performing variable selection - PCA regression: project onto the top principal components of $X$, discarding the near-zero variance directions - Simply dropping one of the correlated predictors if you understand the data well enough

Part 3: Recursive Least Squares (RLS)

Suppose you have computed $\hat{\beta}_n$ from $n$ observations and you receive a new pair $(x_{n+1}, y_{n+1})$. Naive re-computation costs $O(np^2)$ every step. RLS gives you an $O(p^2)$ update.

Define $P_n = (X_n^TX_n)^{-1}$ (the inverse of the Gram matrix after $n$ observations). When observation $n+1$ arrives:

Step 1 -- Update the inverse using the Sherman-Morrison formula:

$P_{n+1} = P_n - \frac{P_n x_{n+1} x_{n+1}^T P_n}{1 + x_{n+1}^T P_n x_{n+1}}$

Step 2 -- Update the coefficients:

$\hat{\beta}_{n+1} = \hat{\beta}_n + P_{n+1} x_{n+1}(y_{n+1} - x_{n+1}^T \hat{\beta}_n)$

The term $(y_{n+1} - x_{n+1}^T \hat{\beta}_n)$ is the prediction error using the old estimate. The update nudges $\hat{\beta}$ in the direction of $x_{n+1}$, scaled by how wrong the prediction was and by $P_{n+1}$ (which captures how "informative" this new direction is relative to what we have already seen).

The key insight is that Sherman-Morrison lets you update a $p \times p$ matrix inverse in $O(p^2)$ instead of $O(p^3)$, so each new observation is cheap. This is the backbone of adaptive filtering and online signal estimation.

Answer:

OLS: $\hat{\beta} = (X^TX)^{-1}X^TY$, valid when $X$ has full column rank.
Multicollinearity: use Ridge regression $\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^TY$ to stabilize the inverse.
RLS: update $P_{n+1}$ via Sherman-Morrison, then $\hat{\beta}_{n+1} = \hat{\beta}_n + P_{n+1}x_{n+1}(y_{n+1} - x_{n+1}^T\hat{\beta}_n)$. Each update is $O(p^2)$.

Intuition

The OLS estimator is really just a projection -- you are finding the point in the column space of $X$ that is closest to $Y$. Every extension in this problem comes from asking "what if that projection is unstable or too expensive to recompute?" Multicollinearity means the column space is nearly degenerate in some direction, so a tiny change in $Y$ causes a huge swing in the projected coefficients. Ridge regression fattens the eigenvalues so the projection is well-behaved, trading a little bias for a lot of stability. Recursive least squares is about efficiency: if your column space grows by one observation at a time, you do not need to recompute the full inverse -- you patch it with a rank-1 update.

In practice, all three of these ideas come up constantly. Any time you are fitting a linear model to correlated financial features (returns of similar stocks, overlapping technical indicators), you are dealing with multicollinearity. And any time you are running a model in production where data streams in continuously -- live PnL attribution, adaptive hedging, real-time signal estimation -- you are doing some variant of RLS. The algebra is elementary, but the intuition for when each tool applies is what separates a textbook answer from a practitioner's answer.

Open the full interactive solver →