Distributed OLS Computation

Question

Consider the ordinary least squares (OLS) regression problem with data matrix $X$ ($n \times p$) and response vector $y$ ($n \times 1$).

(a) What is the closed-form solution for the OLS estimator $\hat{\beta}$?

(b) Suppose the data matrix $X$ has so many rows that it cannot fit in memory on a single machine (i.e., $n$ is very large, but $p$ is moderate). How can you compute the OLS solution in a distributed or streaming fashion?

Hint: Consider the dimensions of $X^T X$.

Accepted Answer

How to Think About It: The OLS normal equations require computing $X^T X$ and $X^T y$. If you look at these products carefully, you notice something crucial: $X^T X$ is $p 	imes p$ and $X^T y$ is $p 	imes 1$, regardless of how large $n$ is. Even if you have billions of observations, these sufficient statistics are tiny when $p$ is moderate. And both are sums over observations, so they can be computed in parallel chunks. This is the key insight that makes distributed OLS possible. Key Insight: The normal equations reduce the entire dataset to two small aggregates -- $X^T X$ ($p 	imes p$) and $X^T y$ ($p 	imes 1$) -- which are additive across observations. The Method: (a) Closed-form OLS solution: Minimize $\|y - X\beta\|^2$ by taking the gradient and setting it to zero: $X^T X \hat{\beta} = X^T y$ $\hat{\beta} = (X^T X)^{-1} X^T y$ This requires $X^T X$ to be invertible ($X$ must have full column rank). (b) Distributed computation: Split the $n$ rows of data across $K$ machines (or process in $K$ sequential chunks). Machine $k$ holds rows $X_k$ ($n_k 	imes p$) and $y_k$ ($n_…

Distributed OLS Computation

Hints

Worked Solution

Intuition