Linear Regression: Model, Estimation, and When to Use It

Regression · Medium · Free problem

Explain the linear regression model. Cover three things:

  1. What is the model structure and what assumptions does it make?
  2. How do you estimate the parameters? Derive the ordinary least squares (OLS) estimator.
  3. When should you use linear regression -- and when should you not?

Hints

  1. Start by writing the model in matrix form $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ -- this makes the derivation much cleaner than working component by component.
  2. To derive OLS, minimize $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$ by taking the gradient with respect to $\boldsymbol{\beta}$ and setting it to zero.
  3. The normal equations give $\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$; solve for $\hat{\boldsymbol{\beta}}$ by inverting $\mathbf{X}^T\mathbf{X}$ (assuming full column rank).

Worked Solution

How to Think About It: Linear regression is the workhorse of quantitative modeling precisely because it is interpretable and has a closed-form solution. Before any derivation, ask yourself what the model is claiming: that the expected value of $y$ moves linearly with each feature, and that the noise is additive and behaves nicely. Whether those claims hold for your data is the practical question.

Key Insight: The OLS estimator minimizes squared residuals, which has a unique closed-form solution as long as the feature matrix has full column rank. The assumptions you need for good inference (unbiased estimates, correct standard errors) are separate from the assumptions you need for the formula to be computable.

The Method:

1. Model Structure.

The linear regression model posits: $y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon$ where $\varepsilon$ is a noise term. In matrix form: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$, where $\mathbf{X}$ is the $n \times (p+1)$ design matrix (with a column of ones for the intercept), $\boldsymbol{\beta}$ is the $(p+1)$-vector of coefficients, and $\boldsymbol{\varepsilon}$ is the $n$-vector of errors.

The Gauss-Markov assumptions for OLS to be the best linear unbiased estimator (BLUE): - Linearity: $E[\mathbf{y}|\mathbf{X}] = \mathbf{X}\boldsymbol{\beta}$ - Exogeneity: $E[\boldsymbol{\varepsilon}|\mathbf{X}] = \mathbf{0}$ (errors uncorrelated with features) - Homoscedasticity: $\text{Var}(\varepsilon_i|\mathbf{X}) = \sigma^2$ for all $i$ - No perfect multicollinearity: $\mathbf{X}$ has full column rank - No autocorrelation: $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$

For normal-theory inference (t-tests, F-tests), we additionally assume $\boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I})$.

2. OLS Derivation.

Minimize the sum of squared residuals: $\min_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$

Expand and take the gradient with respect to $\boldsymbol{\beta}$, set to zero: $\frac{\partial}{\partial \boldsymbol{\beta}}\left[\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\right] = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}$

Solving (assuming $\mathbf{X}^T\mathbf{X}$ is invertible): $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

This is the normal equations solution. Geometrically, $\mathbf{X}\hat{\boldsymbol{\beta}}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$.

3. When to Use It.

Use linear regression when: - The relationship between $y$ and the features is plausibly linear (or can be made linear via feature transformations) - You need interpretable coefficients -- $\hat{\beta}_j$ is the marginal effect of $x_j$ holding others fixed - You are working with a moderately sized dataset where the closed form is computationally cheap - You want to understand marginal effects or run hypothesis tests on coefficients

Be cautious or avoid when: - The response variable is binary or count data (use logistic regression or Poisson instead) - Errors are clearly heteroscedastic or autocorrelated (financial time series almost always violate no-autocorrelation) - $p \gg n$ -- the matrix $\mathbf{X}^T\mathbf{X}$ becomes ill-conditioned; use ridge or lasso - The true relationship is highly nonlinear

Practical Considerations: Always check residual plots (residuals vs. fitted values for heteroscedasticity, Q-Q plot for normality), $R^2$ as a goodness-of-fit summary, and variance inflation factors (VIF) for multicollinearity. In financial applications, be especially alert to autocorrelated errors -- OLS coefficients remain unbiased but standard errors are wrong, which invalidates all t-tests.

Answer: The model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$. OLS estimator: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$. Use it when linearity is reasonable and you need interpretable marginal effects; avoid it when errors are serially correlated, when $p \gg n$, or when the response distribution is non-Gaussian.

Intuition

Linear regression is worth understanding at a deep level because it is the building block for almost everything else in statistical modeling. Ridge regression, principal components regression, and generalized linear models are all extensions or modifications of the core OLS idea. When you derive $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$, you are really computing an orthogonal projection -- the fitted values $\hat{\mathbf{y}}$ are the closest point in the column space of $\mathbf{X}$ to $\mathbf{y}$. That geometric view is the cleanest way to understand why OLS works and what goes wrong when $\mathbf{X}^T\mathbf{X}$ is near-singular.

In practice, the most common mistake quants make with linear regression in financial data is ignoring serial correlation in residuals. OLS coefficients are still unbiased, but the standard errors are understated, so you get falsely precise t-statistics. The fix is either Newey-West standard errors (robust to autocorrelation) or a generalized least squares specification that models the error covariance structure explicitly.

Open the full interactive solver →