Answer the following three questions about linear regression:
Derive the OLS estimator. Starting from the linear model $Y = X\beta + \varepsilon$, derive the closed-form formula for the ordinary least squares estimator $\hat{\beta}$ by minimizing the sum of squared residuals.
Define bias. What does it mean for the OLS estimator to be biased or unbiased? Under what conditions is OLS unbiased, and what can cause bias?
3. Inverse regression relationship. For one-dimensional data $(X, Y)$, consider two simple regressions (no intercept): - $Y = W_1 X$ - $X = W_2 Y$
What is the relationship between $W_1$, $W_2$, and 1? Specifically, show that $W_1 \cdot W_2 \leq 1$ and characterize when equality holds.
Hints
For Part 1, expand $\|Y - X\beta\|^2$, differentiate with respect to $\beta$, and set to zero.
For Part 2, substitute $Y = X\beta + \varepsilon$ into $\hat{\beta} = (X^TX)^{-1}X^TY$ and take expectations. What assumption on $\varepsilon$ makes the bias term vanish?
For Part 3, write out $W_1 = \text{Cov}(X,Y)/\text{Var}(X)$ and $W_2 = \text{Cov}(X,Y)/\text{Var}(Y)$. Their product is $r^2$ by definition.
Worked Solution
How to Think About It: These three parts cover the foundations of linear regression that any quant should know cold. Part 1 is pure calculus/linear algebra. Part 2 tests whether you understand the assumptions that make OLS work. Part 3 is the subtle one -- it reveals that regression is not symmetric. Regressing $Y$ on $X$ and $X$ on $Y$ give different slopes, and their product is bounded by 1. This is a favorite interview question because many candidates assume $W_2 = 1/W_1$, which is wrong.
Quick Estimate: For Part 3, the product $W_1 W_2$ is a normalized squared inner product (a squared cosine of the angle between the data vectors $X$ and $Y$), so it must lie in $[0, 1]$. It hits 1 only when the two vectors are perfectly aligned (linearly dependent), and shrinks toward 0 as they become orthogonal. So you should expect $W_1 W_2 \le 1$ with equality only in the degenerate, no-scatter case.
Approach: Part 1 -- write the sum of squared residuals as a quadratic in $\beta$, differentiate, set to zero, solve the normal equations. Part 2 -- substitute the model into the estimator and take a conditional expectation; the bias term is governed by $E[\varepsilon\mid X]$. Part 3 -- because the regressions have no intercept, the slopes are the *uncentered* least-squares coefficients $W_1=\sum_i x_i y_i/\sum_i x_i^2$ and $W_2=\sum_i x_i y_i/\sum_i y_i^2$; multiply them and apply the Cauchy-Schwarz inequality to the data vectors.
$L(\beta) = Y^T Y - 2\beta^T X^T Y + \beta^T X^T X \beta$
Taking the gradient with respect to $\beta$ and setting it to zero gives the normal equations:
$\nabla_\beta L = -2X^T Y + 2X^T X \beta = 0 \quad\Longrightarrow\quad X^T X\,\beta = X^T Y$
Solving (assuming $X^T X$ is invertible, i.e. the columns of $X$ are linearly independent):
$\hat{\beta} = (X^T X)^{-1} X^T Y$
This is the OLS estimator. It is a minimum because the Hessian
X^T X$ is positive semi-definite, so $L$ is convex.
Part 2: Bias
Bias of an estimator $\hat{\beta}$ is defined as $\text{Bias}(\hat{\beta}) = E[\hat{\beta}] - \beta$. The estimator is unbiased if $E[\hat{\beta}] = \beta$.
Substituting $Y = X\beta + \varepsilon$ into the OLS formula:
So OLS is unbiased if and only if $E[\varepsilon \mid X] = 0$ (the exogeneity assumption). Common sources of bias:
Omitted variable bias: A relevant variable correlated with $X$ is left out
Endogeneity: $X$ is correlated with $\varepsilon$ (e.g., simultaneous equations, measurement error)
Model misspecification: The true relationship is nonlinear but we fit a linear model
Note that homoskedasticity and normality of the errors affect efficiency and exact-distribution inference, not unbiasedness; unbiasedness only requires $E[\varepsilon\mid X]=0$.
Part 3: Inverse Regression
The two regressions have no intercept, so we do not center the data. For $Y = W_1 X$ we minimize $\sum_i (y_i - W_1 x_i)^2$; differentiating and setting to zero gives the *uncentered* least-squares slope
$W_1 = \frac{\sum_i x_i y_i}{\sum_i x_i^2}.$
Symmetrically, for $X = W_2 Y$ we minimize $\sum_i (x_i - W_2 y_i)^2$, giving
where $\theta$ is the angle between the data vectors $X=(x_1,\dots,x_n)$ and $Y=(y_1,\dots,y_n)$. By the Cauchy-Schwarz inequality $\langle X,Y\rangle^2 \le \|X\|^2\,\|Y\|^2$, so
$W_1 \cdot W_2 = \cos^2\theta \le 1.$
Equality holds if and only if $X$ and $Y$ are linearly dependent as vectors, i.e. $y_i = c\,x_i$ for all observations (the data lie exactly on a line through the origin, with no scatter). This is the uncentered analogue of perfect correlation; the common mistake $W_2 = 1/W_1$ is wrong because the two regressions minimize different residuals (vertical for $Y$ on $X$, horizontal for $X$ on $Y$).
(Aside: if instead the data are first centered to mean zero -- equivalently, if the regressions included an intercept -- then $\sum_i x_i y_i = n\,\text{Cov}(X,Y)$, $\sum_i x_i^2 = n\,\text{Var}(X)$, etc., and the product becomes exactly $r^2$, the squared Pearson correlation. The structure is identical; only the centering differs.)
Answer:
$\hat{\beta} = (X^T X)^{-1} X^T Y$
OLS is unbiased if and only if $E[\varepsilon \mid X] = 0$ (exogeneity). Bias arises from omitted variables, endogeneity, or model misspecification; homoskedasticity/normality affect efficiency and inference, not bias.
With no intercept, $W_1 = \sum_i x_i y_i/\sum_i x_i^2$ and $W_2 = \sum_i x_i y_i/\sum_i y_i^2$, so $W_1 W_2 = (\sum_i x_i y_i)^2/[(\sum_i x_i^2)(\sum_i y_i^2)] = \cos^2\theta \le 1$ by Cauchy-Schwarz, with equality iff $X$ and $Y$ are linearly dependent ($y_i = c\,x_i$ for all $i$).
Intuition
The deep lesson in Part 3 is that regression is fundamentally asymmetric. Regressing $Y$ on $X$ minimizes vertical distances (residuals in $Y$), while regressing $X$ on $Y$ minimizes horizontal distances (residuals in $X$). These give different lines unless the data is perfectly correlated. The product $W_1 W_2 = r^2$ is a beautiful result because it connects two seemingly different regression coefficients to the correlation. In practice, this asymmetry matters when you are deciding which variable to treat as the predictor -- the choice is not arbitrary, and reversing the regression gives you a different (and generally biased) estimate of the inverse relationship. This is why, for example, calibrating a model by regressing predicted on actual is not the same as regressing actual on predicted.