Logistic Regression vs. Linear Regression vs. SVM
Compare and contrast three workhorse supervised learning models: Linear Regression, Logistic Regression, and Support Vector Machines (SVMs).
For each model, address the following:
- What type of task is it designed for (regression vs. classification)?
- What loss function does it minimize, and what does that choice imply about sensitivity to outliers?
- What form does its decision boundary take, and how can it handle non-linear relationships?
- Does it produce calibrated probabilities, or just class labels?
Finally, given a new dataset, explain how you would decide which of the three to use. What properties of the data and the business problem drive that choice?
Hints
- Start by comparing what each model optimizes -- the loss function tells you almost everything about how the model behaves.
- Think about what each model outputs: a continuous value, a probability, or a class label. This distinction drives when you would use each one in practice.
- Consider edge cases: What happens with outliers? With high-dimensional data? When you need calibrated probabilities for downstream decisions?
Worked Solution
How to Think About It: These three models sit on a spectrum. Linear regression is for continuous targets -- it fits a line through a cloud of points. Logistic regression and SVM are both classifiers, but they attack the problem differently: logistic regression maximizes the likelihood of the observed labels (it is a probabilistic model), while SVM maximizes the geometric margin between classes (it is a geometric model). That distinction drives most of the practical differences. When an interviewer asks this, they want to see that you understand the loss functions and what each one implies -- not just a feature comparison table.
Key Insight: The loss function is the heart of each model. MSE penalizes large errors quadratically (sensitive to outliers), log-loss gives calibrated probabilities, and hinge loss focuses only on points near the decision boundary (the support vectors).
The Method:
*Linear Regression:* - Task: Regression (continuous target). - Minimizes $\sum_i (y_i - x_i^T \beta)^2$ (MSE). The quadratic penalty means outliers dominate the fit. - Output is a continuous value $\hat{y} = x^T \beta$. No decision boundary -- it is not a classifier. - Assumes Gaussian errors. Gives interpretable coefficients: $\beta_j$ is the change in $y$ per unit change in $x_j$, holding others fixed.
*Logistic Regression:* - Task: Classification (binary or multiclass). - Minimizes the negative log-likelihood (cross-entropy): $-\sum_i [y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)]$. - Outputs calibrated probabilities via the sigmoid: $P(y = 1 \mid x) = \sigma(x^T \beta)$. - Decision boundary is linear in feature space: $x^T \beta = 0$. Non-linearity requires manual feature engineering (add $x^2$, interactions, etc.). - Coefficients are interpretable as log-odds ratios.
*Support Vector Machine:* - Task: Classification. - Minimizes hinge loss: $\sum_i \max(0, 1 - y_i (x_i^T w))$ plus a regularization term $\frac{1}{2} \|w\|^2$. - Finds the maximum-margin separating hyperplane. Only points on or inside the margin (support vectors) affect the solution. - Does NOT natively output probabilities. You need Platt scaling (fit a sigmoid on the SVM scores) to get probabilities, and they are typically less well-calibrated than logistic regression. - The kernel trick ($K(x_i, x_j)$ replaces dot products) allows non-linear boundaries without explicitly computing high-dimensional features. Common kernels: RBF, polynomial.
Practical Considerations:
- *Need probabilities?* Use logistic regression. Trading desks care about calibrated probabilities (e.g., "what is the probability this signal fires correctly?"), not just labels.
- *High-dimensional, sparse data (e.g., text)?* SVM with linear kernel often dominates -- the max-margin objective generalizes well when $d \gg n$.
- *Small sample, clear margin?* SVM. The margin-maximization objective is a strong inductive bias with limited data.
- *Interpretability matters?* Logistic regression. Coefficients map directly to feature importance.
- *Non-linear boundaries needed?* Kernel SVM or add polynomial/interaction features to logistic regression. Or skip both and use a tree-based model.
- *Continuous target?* Linear regression (or its regularized variants: ridge, lasso).
Answer: Linear regression minimizes MSE for continuous targets. Logistic regression minimizes cross-entropy and outputs calibrated probabilities for classification. SVM minimizes hinge loss to find the maximum-margin classifier, excels in high dimensions, but does not natively produce probabilities. Choose logistic regression when you need probabilities and interpretability, SVM when you need strong generalization with limited data or high-dimensional features, and linear regression when the target is continuous.
Intuition
The deeper lesson here is that the loss function IS the model, in a very real sense. MSE, log-loss, and hinge loss are three different answers to the question "what does it mean to make a mistake?" MSE says all errors matter, and big ones matter a lot (quadratic penalty). Log-loss says the cost of being confidently wrong is enormous (the log blows up near 0). Hinge loss says only misclassifications and near-misses matter -- once you are safely on the right side, the loss is zero. Each of these philosophies leads to a different kind of model, and understanding them lets you pick the right tool without memorizing comparison tables.
In quant finance, this matters because the choice of loss function directly affects how your model handles outliers, tail events, and noise. A model trained with MSE will chase outliers. A model with hinge loss will ignore them if they are on the correct side. Logistic regression gives you probabilities you can feed into a Kelly criterion or position-sizing formula. The model you choose should match the decision you need to make downstream.