Answer the following three questions about linear regression: 1. Derive the OLS estimator. Starting from the linear model $Y = X\beta + \varepsilon$, derive the closed-form formula for the ordinary least squares estimator $\hat{\beta}$ by minimizing the sum of squared residuals. 2. Define bias. Wha…

OLS Derivation, Bias, and Inverse Regression

Regression · Medium · Free problem

Answer the following three questions about linear regression:

Derive the OLS estimator. Starting from the linear model $Y = X\beta + \varepsilon$, derive the closed-form formula for the ordinary least squares estimator $\hat{\beta}$ by minimizing the sum of squared residuals.

Define bias. What does it mean for the OLS estimator to be biased or unbiased? Under what conditions is OLS unbiased, and what can cause bias?

3. Inverse regression relationship. For one-dimensional data $(X, Y)$, consider two simple regressions (no intercept): - $Y = W_1 X$ - $X = W_2 Y$

What is the relationship between $W_1$, $W_2$, and 1? Specifically, show that $W_1 \cdot W_2 \leq 1$ and characterize when equality holds.

Hints

For Part 1, expand $\|Y - X\beta\|^2$, differentiate with respect to $\beta$, and set to zero.
For Part 2, substitute $Y = X\beta + \varepsilon$ into $\hat{\beta} = (X^TX)^{-1}X^TY$ and take expectations. What assumption on $\varepsilon$ makes the bias term vanish?
For Part 3, write out $W_1 = \text{Cov}(X,Y)/\text{Var}(X)$ and $W_2 = \text{Cov}(X,Y)/\text{Var}(Y)$. Their product is $r^2$ by definition.

Worked Solution

How to Think About It: These three parts cover the foundations of linear regression that any quant should know cold. Part 1 is pure calculus/linear algebra. Part 2 tests whether you understand the assumptions that make OLS work. Part 3 is the subtle one -- it reveals that regression is not symmetric. Regressing $Y$ on $X$ and $X$ on $Y$ give different slopes, and their product is bounded by 1. This is a favorite interview question because many candidates assume $W_2 = 1/W_1$, which is wrong.

Quick Estimate: For Part 3, the product $W_1 W_2$ is a normalized squared inner product (a squared cosine of the angle between the data vectors $X$ and $Y$), so it must lie in $[0, 1]$. It hits 1 only when the two vectors are perfectly aligned (linearly dependent), and shrinks toward 0 as they become orthogonal. So you should expect $W_1 W_2 \le 1$ with equality only in the degenerate, no-scatter case.

Approach: Part 1 -- write the sum of squared residuals as a quadratic in $\beta$, differentiate, set to zero, solve the normal equations. Part 2 -- substitute the model into the estimator and take a conditional expectation; the bias term is governed by $E[\varepsilon\mid X]$. Part 3 -- because the regressions have no intercept, the slopes are the *uncentered* least-squares coefficients $W_1=\sum_i x_i y_i/\sum_i x_i^2$ and $W_2=\sum_i x_i y_i/\sum_i y_i^2$; multiply them and apply the Cauchy-Schwarz inequality to the data vectors.

Formal Solution:

Part 1: OLS Derivation

We want to minimize the sum of squared residuals:

$L(\beta) = \|Y - X\beta\|^2 = (Y - X\beta)^T(Y - X\beta)$

Expanding:

$L(\beta) = Y^T Y - 2\beta^T X^T Y + \beta^T X^T X \beta$

Taking the gradient with respect to $\beta$ and setting it to zero gives the normal equations:

$\nabla_\beta L = -2X^T Y + 2X^T X \beta = 0 \quad\Longrightarrow\quad X^T X\,\beta = X^T Y$

Solving (assuming $X^T X$ is invertible, i.e. the columns of $X$ are linearly independent):

$\hat{\beta} = (X^T X)^{-1} X^T Y$

This is the OLS estimator. It is a minimum because the Hessian

Intuition

The deep lesson in Part 3 is that regression is fundamentally asymmetric. Regressing $Y$ on $X$ minimizes vertical distances (residuals in $Y$), while regressing $X$ on $Y$ minimizes horizontal distances (residuals in $X$). These give different lines unless the data is perfectly correlated. The product $W_1 W_2 = r^2$ is a beautiful result because it connects two seemingly different regression coefficients to the correlation. In practice, this asymmetry matters when you are deciding which variable to treat as the predictor -- the choice is not arbitrary, and reversing the regression gives you a different (and generally biased) estimate of the inverse relationship. This is why, for example, calibrating a model by regressing predicted on actual is not the same as regressing actual on predicted.