Precision and Recall in Classification
Explain precision and recall as evaluation metrics for a binary classifier. Your answer should cover:
- The definitions in terms of the confusion matrix (TP, FP, TN, FN).
- The intuitive interpretation of each -- what does high precision mean? High recall?
- The trade-off between them and when each matters more.
- How the F1-score synthesizes the two.
Hints
- Draw a 2x2 confusion matrix and label the four cells -- precision and recall are both ratios involving TP in the numerator.
- Think about what happens as you move the classification threshold: which errors increase and which decrease?
- Precision = 'of what I called positive, how much is actually positive'; recall = 'of what is actually positive, how much did I find.'
Worked Solution
How to Think About It: A classifier makes two types of errors: it can raise a false alarm (say positive when it is negative) or miss a real event (say negative when it is positive). Precision and recall are two ways of asking 'how often is it wrong?' from different perspectives. Neither is universally better -- which one you optimize depends on the asymmetry of your error costs.
Key Insight: Precision asks 'how often are my positive predictions actually correct?' Recall asks 'of all the real positives, how many did I catch?' They move in opposite directions as you adjust the classification threshold.
Definitions:
For a binary classifier with confusion matrix entries TP (true positives), FP (false positives), TN (true negatives), FN (false negatives):
$\text{Precision} = \frac{TP}{TP + FP}$
Of all observations the model labeled positive, what fraction truly is positive? High precision means few false alarms.
$\text{Recall} = \frac{TP}{TP + FN}$
Of all truly positive observations, what fraction did the model catch? High recall means few missed events.
The trade-off:
Raising the classification threshold (requiring higher predicted probability before calling something positive) increases precision but decreases recall: the model only flags high-confidence positives, so most of its predictions are correct, but it misses borderline cases. Lowering the threshold does the opposite.
This trade-off is visualized in the precision-recall curve. In highly imbalanced problems (e.g., fraud detection where 0.1% of transactions are fraudulent), the PR curve is more informative than the ROC curve.
When each matters:
- High precision is critical when false alarms are expensive. A trade signal generation system is the canonical example: every false positive is a costly trade. You would rather miss some real opportunities than execute bad trades.
- High recall is critical when missing true positives is expensive. Fraud detection or medical screening: missing a fraud case or a disease has severe consequences. You would rather investigate some false alarms than miss real events.
F1-score:
The F1-score is the harmonic mean of precision and recall: $F_1 = \frac{2 \cdot P \cdot R}{P + R}$
The harmonic mean penalizes extreme imbalances: a model with precision 1.0 and recall 0.01 gets $F_1 = 0.020$, not 0.505. This is appropriate -- a model that never fires has perfect precision but is useless.
Answer: Precision $= TP/(TP+FP)$ measures false alarm rate; recall $= TP/(TP+FN)$ measures detection rate. They trade off against each other with threshold adjustment. In quant applications: optimize precision for signal generation, recall for risk/fraud detection. F1 balances both via harmonic mean.
Intuition
Precision and recall are useful because they decompose the error structure of a classifier in a way that overall accuracy cannot. A model that labels everything as negative achieves perfect precision (vacuously -- it makes no positive predictions) but zero recall. Accuracy on a highly imbalanced dataset can be 99.9% while the model learns nothing. Precision and recall expose this.
In practice, the choice of threshold is a business decision, not a modeling one. A quant building a trade signal decides: 'I would rather trade 10 real opportunities and miss 90 than trade 100 times with 50 false signals.' That cost structure maps directly to a target point on the precision-recall curve. The F1-score is a useful single-number summary when you have no strong prior on the cost ratio, but for production systems you almost always have a cost model and should optimize accordingly.