Linear Regression Interview Questions: OLS Assumptions, R² Traps & Regression to the Mean

Q: What are the OLS assumptions asked about in quant interviews?

The Gauss-Markov assumptions: linearity in parameters, exogeneity (errors uncorrelated with regressors), homoskedasticity, no autocorrelation of errors, and no perfect multicollinearity. Under these, OLS is the best linear unbiased estimator. Normality of errors is not required for unbiasedness — it only matters for exact finite-sample t and F tests, a distinction interviewers frequently probe.

Q: What is the most important regression formula for interviews?

The univariate OLS slope: beta = Cov(x,y)/Var(x) = rho times sigma_y over sigma_x. It immediately gives R-squared = rho-squared for simple regression and explains why regressing y on x and x on y produce slopes whose product is rho-squared, not 1. A large share of regression interview questions are this formula in disguise.

Q: Why doesn't reversing a regression give the reciprocal slope?

The slope of y on x is rho·sigma_y/sigma_x while the slope of x on y is rho·sigma_x/sigma_y, so their product is rho-squared, which is less than 1 unless the variables are perfectly correlated. Each regression shrinks its prediction toward the mean, which is exactly the regression-to-the-mean effect. Answering 1/beta is one of the most common ways candidates fail this question.

Q: What happens to OLS under heteroskedasticity or autocorrelated errors?

Coefficient estimates remain unbiased and consistent, but the usual standard errors are wrong, typically making results look more significant than they are on financial data. The standard fixes are White (robust) standard errors for heteroskedasticity and Newey-West standard errors for autocorrelation. Naming both fixes, not just the problem, is what interviewers look for.

The regression questions Two Sigma, DE Shaw, and Citadel actually ask — and the one formula that answers half of them.

Updated July 3, 2026 · QuantVault

Linear regression is the single most common statistics topic in quant researcher interviews, and it is not close. Alpha research at systematic funds is, at its core, regressing future returns on signals — so interviewers at Two Sigma, DE Shaw, and Citadel use regression questions to test whether you understand the tool you would use every day. The good news: the question pool is narrow. Almost everything reduces to one formula, five assumptions, and three or four named traps.

Why interviewers ask regression

A regression question does three things at once. It checks probability fundamentals (covariance, variance, conditional expectation), it checks whether you can push algebra under pressure, and — most importantly — it checks whether you know when the standard machinery lies to you. Financial data violates textbook assumptions constantly: heteroskedastic returns, autocorrelated residuals, overlapping observations, signals with correlation 0.02 to the target. A candidate who can recite the OLS estimator but cannot say why a t-statistic on daily-return data is overstated will not pass a QR loop.

The core toolkit

For simple (univariate) regression $y = \alpha + \beta x + \varepsilon$, OLS gives

$$\hat{\beta} = \frac{\operatorname{Cov}(x, y)}{\operatorname{Var}(x)} = \rho \, \frac{\sigma_y}{\sigma_x}, \qquad \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}.$$

Memorize $\hat\beta = \rho\,\sigma_y/\sigma_x$ cold — it answers a remarkable fraction of interview questions on its own. Two immediate consequences interviewers love:

R² in one variable: for simple regression, $R^2 = \rho^2$. No exceptions.
Regressions don't invert: the slope of $y$ on $x$ is $\rho\sigma_y/\sigma_x$; the slope of $x$ on $y$ is $\rho\sigma_x/\sigma_y$. Their product is $\rho^2 \le 1$, not 1. If someone asks “you regressed y on x and got 2; what do you get regressing x on y?” the answer is not 1/2.

The Gauss–Markov assumptions under which OLS is BLUE (best linear unbiased estimator): linearity in parameters, exogeneity ($E[\varepsilon \mid x] = 0$), homoskedasticity, no autocorrelation of errors, and no perfect multicollinearity. Note what is not on the list: normality of errors. Normality is only needed for exact finite-sample t and F distributions, not for unbiasedness or BLUE — a distinction interviewers probe directly. For the multivariate case, know $\hat\beta = (X^\top X)^{-1}X^\top y$ and its geometric reading as an orthogonal projection of $y$ onto the column space of $X$; that connects regression to the linear algebra questions in the same loop.

Worked example: a signal, its beta, and regression to the mean

A representative interview chain, asked essentially verbatim at systematic funds:

“A stock's daily return has volatility 2%, the market's is 1%, and their correlation is 0.5. You regress the stock on the market. What are the beta and the R²? Then: the stock returned +4% today — what is your best guess for the market's move, and for the stock's return tomorrow if returns are IID?”

Step 1 — beta. $\hat\beta = \rho\,\sigma_y/\sigma_x = 0.5 \times \tfrac{2\%}{1\%} = 1.0.$ A beta of 1 despite the stock being twice as volatile — the extra volatility is idiosyncratic.

Step 2 — R². $R^2 = \rho^2 = 0.25$. Only a quarter of the stock's variance is explained by the market; the residual volatility is $\sigma_y\sqrt{1-\rho^2} = 2\%\times\sqrt{0.75} \approx 1.73\%$.

Step 3 — the reverse regression. Predicting the market from the stock uses slope $\rho\,\sigma_x/\sigma_y = 0.5 \times \tfrac{1}{2} = 0.25$, so the best guess for the market is $0.25 \times 4\% = 1\%$ — not $4\%/\beta = 4\%$. In standardized units this is regression to the mean: the stock moved $+2\sigma$, so the predicted market move is $\rho \times 2\sigma = +1\sigma$. Extreme observations predict less-extreme partners, always, whenever $|\rho| < 1$.

Step 4 — tomorrow. If daily returns are IID, today's +4% contains zero information about tomorrow; the best guess is the unconditional mean, roughly 0. Candidates who say “it should mean-revert after a big move” are importing an autocorrelation assumption they were explicitly told not to make.

The traps that fail candidates

R² worship. Adding any regressor never decreases in-sample R², which is why researchers report adjusted R² or out-of-sample fit. Also know that in return-prediction work, a daily R² of 1% can be an excellent signal — calibrating what “good” looks like separates practitioners from textbook readers.
Omitted variable bias. If a true driver $z$ is omitted and correlated with $x$, then $\hat\beta$ picks up $\beta_z \cdot \operatorname{Cov}(x,z)/\operatorname{Var}(x)$ in bias. Interviewers ask you to sign the bias, not just name it.
Heteroskedasticity and autocorrelation. Coefficients stay unbiased; the standard errors are wrong, typically overstating significance on financial data. The fixes to name are White/robust and Newey–West standard errors, which leads naturally into time series questions about autocorrelated residuals.
Multicollinearity. Correlated regressors inflate coefficient variance and make individual betas unstable, even though joint predictions can be fine. The regularized fixes (ridge, lasso) are the bridge to the machine learning round.
Correlation is not causation, stated precisely. The sharp version: OLS estimates $E[y\mid x]$ under exogeneity; if $x$ is correlated with the error (simultaneity, selection, measurement error in $x$ — which biases $\hat\beta$ toward zero), the coefficient is not causal and not even a consistent estimate of the structural slope.

How to prepare

Drill until $\hat\beta = \rho\,\sigma_y/\sigma_x$, $R^2 = \rho^2$, and the standardized-prediction rule $\hat z_y = \rho z_x$ are reflexes, then practice the assumption-violation questions out loud — they are graded on precision of language. Work through our regression question bank for the full range from slope algebra to Newey–West, the statistics bank for the hypothesis-testing questions that share the same loop, and the broader problem library to mix regression into timed sets the way a real superday does.

Frequently asked questions

What are the OLS assumptions asked about in quant interviews?

The Gauss-Markov assumptions: linearity in parameters, exogeneity (errors uncorrelated with regressors), homoskedasticity, no autocorrelation of errors, and no perfect multicollinearity. Under these, OLS is the best linear unbiased estimator. Normality of errors is not required for unbiasedness — it only matters for exact finite-sample t and F tests, a distinction interviewers frequently probe.

What is the most important regression formula for interviews?