Designing a Real-Time Fraud Detection System

Machine Learning · Medium · Free problem

You are asked to design an online (real-time) algorithm to detect fraudulent transactions -- for example, flagging suspicious orders on an e-commerce platform.

Address the following:

What features would you engineer for the model? Consider transaction-level, behavioral, and aggregate features.
What model(s) would you use, and why? What are the latency constraints?
How would you handle the severe class imbalance (fraud is typically less than 1% of transactions)?
How would you deploy, monitor, and retrain the system in production?

Hints

Think in layers: rules for obvious patterns, ML for subtle ones, human review for uncertain cases. The system design matters as much as the model choice.
Behavioral features (velocity, geographic anomalies, spending pattern deviation) are often more predictive than transaction-level features alone. They require maintaining real-time user profiles.
Never optimize for accuracy with imbalanced classes. Use precision-recall AUC, set thresholds based on the cost ratio of false negatives to false positives, and complement supervised models with anomaly detection for novel fraud patterns.

Worked Solution

How to Think About It: Fraud detection is a classic real-time classification problem with three hard constraints: (1) you need sub-100ms latency because the decision must happen before the transaction completes, (2) the classes are massively imbalanced -- fraud is maybe 0.1-1% of transactions, (3) the adversary adapts, so your model degrades over time. The system design matters as much as the model. If an interviewer asks this, they want to see that you can think end-to-end, not just pick a classifier.

Key Insight: The right answer is not a single model -- it's a pipeline. Rules catch the obvious cases (known fraud patterns, velocity limits). An ML model catches the subtle cases. Human review handles the uncertain middle. The feedback loop from human review to model retraining is what keeps the system alive.

The Method:

1. Feature Engineering:

*Transaction-level features:* - Amount, merchant category code (MCC), payment method, card-present vs. card-not-present - Time of day, day of week (fraud patterns differ by time) - Geographic location, IP address, device fingerprint - Shipping address vs. billing address mismatch

*Behavioral features (require a user profile):* - Velocity: number of transactions in the last 1 hour, 24 hours, 7 days - Amount deviation: how far this transaction is from the user's rolling average - Geographic velocity: distance from last transaction divided by time elapsed ("impossible travel") - Category deviation: is this merchant type unusual for this user?

*Aggregate/network features:* - Merchant fraud rate (historical fraud incidence at this merchant) - BIN-level fraud rate (fraud rate for this card issuer/range) - Device/IP reputation (has this device been associated with prior fraud?) - Graph features: shared payment methods, addresses, or devices across accounts

2. Model Selection:

*Real-time scoring layer:* - Gradient boosted trees (XGBoost/LightGBM) are the standard choice. They handle mixed feature types naturally, are fast at inference (~microseconds per prediction), and provide feature importances for explainability. - Logistic regression as a fallback for extreme latency requirements or as a first-pass filter.

*Complementary models:* - Rule-based filters for hard cutoffs (e.g., block transactions >

0,000 from new accounts, flag impossible travel) - Anomaly detection (isolation forest, autoencoder) for detecting novel fraud patterns the supervised model hasn't seen - A neural network (if latency budget allows) for complex feature interactions

*Latency:* The entire scoring pipeline must complete in < 100ms. Feature computation from pre-aggregated stores (Redis/Memcached) takes ~10-30ms. Model inference takes ~1-5ms. The bottleneck is feature retrieval, not model scoring.

3. Handling Class Imbalance:

Cost-sensitive learning: Assign higher weight to fraud samples in the loss function. A false negative (missed fraud) costs orders of magnitude more than a false positive (declined legitimate transaction). Set class weights proportional to the cost ratio, not just the inverse frequency.
Stratified sampling: Ensure every training batch contains fraud examples. Undersample the majority class rather than oversampling the minority -- SMOTE can create unrealistic synthetic examples.
Evaluation metric: Never use accuracy. Use precision-recall AUC, or optimize for a specific recall target (e.g., catch 95% of fraud) and report the corresponding precision. The F1 score at the operating threshold is a useful summary.
Threshold tuning: The model outputs a probability. The decision threshold should be set based on the business cost tradeoff: $\text{threshold}^{*} = \text{argmin} \; [C_{FN} \cdot FNR(t) + C_{FP} \cdot FPR(t)]$.
Anomaly detection complement: Unsupervised methods (isolation forest, autoencoder reconstruction error) catch novel fraud patterns that the supervised model misses because they weren't in the training data.

4. Deployment and Monitoring:

*Deployment:* - Score every transaction in real-time. High-confidence fraud: auto-block. Low-confidence: auto-approve. Middle band: queue for human review. - Serve the model behind a low-latency API. Use a model registry (MLflow) for versioning.

*Monitoring:* - Track the score distribution over time. Distribution shift signals either changing fraud patterns or data pipeline issues. - Monitor precision and recall on a rolling basis using labeled data from human review. - Alert on sudden changes in fraud rate, score distribution, or feature distributions.

*Retraining:* - Retrain on a regular cadence (weekly or biweekly) incorporating newly labeled data. - Use the human review feedback loop: transactions flagged for review get labeled (fraud/not fraud) and feed back into training. - Watch for label delay: fraud may not be confirmed for days or weeks (chargebacks), so recent data has label noise. - A/B test new model versions: route a fraction of traffic to the new model and compare fraud catch rates.

Practical Considerations: - Cold start problem: new users have no behavioral history. Use population-level features and be more conservative (lower threshold) until sufficient history accumulates. - Adversarial adaptation: fraudsters learn the rules. Continuously add new features and retrain. Graph-based features (linking accounts by shared attributes) are harder to game than transaction-level features. - Explainability: regulations (and customer support) require explaining why a transaction was declined. Tree-based models with SHAP values provide this.

Answer: Build a layered pipeline: rule-based filters for obvious cases, a gradient-boosted tree for real-time ML scoring (< 100ms), and human review for the uncertain middle. Engineer behavioral and velocity features from pre-aggregated user profiles. Handle class imbalance with cost-sensitive learning and PR-AUC evaluation. Monitor score distributions and retrain weekly with feedback from human review labels.

Intuition

Fraud detection is the canonical example of a machine learning system where the model is the easy part and the system design is the hard part. The core ML task -- binary classification with imbalanced classes -- is well understood. What makes it difficult in practice is the real-time constraint (features must be pre-computed and cached), the feedback loop (labels arrive with delay via chargebacks), the adversarial dynamics (fraudsters adapt to your model), and the business constraints (you can't decline too many legitimate transactions).

The key takeaway for interviews: show that you think about the entire lifecycle, not just model training. Feature engineering from real-time aggregates, cost-sensitive threshold selection, monitoring for distribution shift, and the retraining cadence are what separate a production system from a Kaggle notebook. This is also a good place to demonstrate awareness of the precision-recall tradeoff: in fraud detection, a 1% false positive rate on millions of legitimate transactions means thousands of angry customers, so precision matters as much as recall.

Open the full interactive solver →