Distributed OLS Computation

Regression · Medium · Free problem

Consider the ordinary least squares (OLS) regression problem with data matrix $X$ ($n \times p$) and response vector $y$ ($n \times 1$).

(a) What is the closed-form solution for the OLS estimator $\hat{\beta}$?

(b) Suppose the data matrix $X$ has so many rows that it cannot fit in memory on a single machine (i.e., $n$ is very large, but $p$ is moderate). How can you compute the OLS solution in a distributed or streaming fashion?

Hint: Consider the dimensions of $X^T X$.

Hints

  1. Write out $X^T X$ and $X^T y$ as sums over individual observations. What are the dimensions of these quantities?
  2. $X^T X$ is $p \times p$ -- it does not grow with $n$. This means the sufficient statistics for OLS are small even when $n$ is huge.
  3. Partition the data across machines. Each machine computes its local $X_k^T X_k$ and $X_k^T y_k$, then a central node aggregates and solves the $p \times p$ system.

Worked Solution

How to Think About It: The OLS normal equations require computing $X^T X$ and $X^T y$. If you look at these products carefully, you notice something crucial: $X^T X$ is $p \times p$ and $X^T y$ is $p \times 1$, regardless of how large $n$ is. Even if you have billions of observations, these sufficient statistics are tiny when $p$ is moderate. And both are sums over observations, so they can be computed in parallel chunks. This is the key insight that makes distributed OLS possible.

Key Insight: The normal equations reduce the entire dataset to two small aggregates -- $X^T X$ ($p \times p$) and $X^T y$ ($p \times 1$) -- which are additive across observations.

The Method:

(a) Closed-form OLS solution:

Minimize $\|y - X\beta\|^2$ by taking the gradient and setting it to zero: $X^T X \hat{\beta} = X^T y$ $\hat{\beta} = (X^T X)^{-1} X^T y$

This requires $X^T X$ to be invertible ($X$ must have full column rank).

(b) Distributed computation:

Split the $n$ rows of data across $K$ machines (or process in $K$ sequential chunks). Machine $k$ holds rows $X_k$ ($n_k \times p$) and $y_k$ ($n_k \times 1$).

Step 1 (local computation): Each machine computes: - $A_k = X_k^T X_k$ -- a $p \times p$ matrix - $b_k = X_k^T y_k$ -- a $p \times 1$ vector

Step 2 (aggregation): A central machine collects and sums: $A = \sum_{k=1}^K A_k = X^T X, \quad b = \sum_{k=1}^K b_k = X^T y$

Step 3 (solve): Compute $\hat{\beta} = A^{-1} b$.

This works because $X^T X$ and $X^T y$ are both sums over individual observations: $X^T X = \sum_{i=1}^n x_i x_i^T, \quad X^T y = \sum_{i=1}^n x_i y_i$

So they can be partitioned and summed in any order.

Practical Considerations:

  • Memory: Each machine only needs to hold its data chunk ($n_k \times p$) and the two aggregates ($p \times p + p \times 1$). For $p = 1000$, $A_k$ is only 8 MB (in double precision).
  • Communication cost: Each machine sends $O(p^2)$ data. With moderate $p$, this is negligible.
  • Streaming variant: You do not even need $K$ machines. A single machine can stream through the data in chunks, accumulating $A$ and $b$ incrementally, then solve at the end. This processes $n$ observations in $O(p)$ memory.
  • Numerical stability: For better conditioning, use a QR decomposition within each chunk or accumulate in a numerically stable way, rather than forming $X^T X$ explicitly (which squares the condition number).
  • Computation cost: Each machine does $O(n_k p^2 / K)$ work for the matrix product. The central inversion is $O(p^3)$.

Answer: The OLS solution is $\hat{\beta} = (X^T X)^{-1} X^T y$. When $n$ is too large for memory, exploit the fact that $X^T X$ ($p \times p$) and $X^T y$ ($p \times 1$) are sums over observations. Compute them in parallel chunks across machines (or stream sequentially), aggregate, and solve the $p \times p$ system. The key observation is that the sufficient statistics are $O(p^2)$ regardless of $n$.

Intuition

This problem illustrates a principle that comes up constantly in large-scale computing: look for sufficient statistics that are smaller than the raw data. In OLS, the entire dataset of $n$ observations collapses into a $p \times p$ matrix and a $p$-vector. Once you have these, you can throw away all the raw data and still compute the exact same answer. This is not an approximation -- it is exact.

The same principle underlies many distributed algorithms in quant finance. When you are computing portfolio risk across millions of scenarios, you often do not need to ship all the scenarios to one place -- you compute local covariance contributions and aggregate. The lesson is always the same: identify what is additive, compute it locally, and aggregate cheaply. The dimensions of the sufficient statistics ($p^2$ vs. $np$) determine whether distribution is practical.

Open the full interactive solver →