When “98% accuracy” turns into a bad model

A data science team launches a churn model and celebrates: 98% accuracy on a held-out spreadsheet. Two weeks later, retention campaigns miss the customers who actually cancel, and support complains that “the model only flags obvious cases.” Nothing is crashing; the model is simply not useful. The issue is a classic beginner trap: an offline score that doesn’t measure what you think it measures, and a model that learned patterns that don’t generalize to new customers.

This lesson gives you the basics you need to evaluate models like an engineer: generalization, train/validation/test splits, and a few core metrics and failure modes. You’ll also see how evaluation connects back to your dataset “contract” (example unit, features available at prediction time, and label definition) and to the training “knobs” you control (hyperparameters that can encourage overfitting or underfitting). If you can’t evaluate well, you can’t debug, compare, or ship models with confidence.

Generalization, overfitting, and what “good” actually means

Generalization means a model performs well on new, unseen examples drawn from the same real-world process you care about. In supervised learning, it’s easy to accidentally optimize for the wrong thing: your model can look impressive on data it has effectively “seen” (directly or indirectly) and still fail the moment it meets the messy variability of production. A useful mental model is that your training set is evidence, not truth: it contains signal, noise, and quirks of how the data was collected.

Overfitting happens when a model learns idiosyncrasies of the training set—noise, rare coincidences, or overly specific rules—so training performance is great but out-of-sample performance drops. Underfitting is the opposite: the model is too constrained (or under-trained) to capture real signal, so it performs poorly even on training data. These ideas connect directly to hyperparameters from the prior lesson: capacity controls (like tree depth or regularization strength) are some of the biggest switches that push you toward overfitting vs underfitting.

A key connection back to the dataset contract is this: you only get meaningful generalization if your examples, features, and labels match the decision you’ll make in practice. If your feature set accidentally includes hints from the future (leakage), you’ll “generalize” in evaluation but fail in production. If your examples are correlated (e.g., many rows per customer) and you split randomly, you can leak customer identity across splits and inflate scores. Good evaluation is partly statistics and partly system design: you are trying to simulate the future decision context, not just shuffle rows.

The evaluation loop: split, train, tune, and test (without fooling yourself)

The basic evaluation workflow is simple: separate data into distinct roles, train on one part, tune decisions on another, and report final results on a third. The tricky part is being disciplined about what each split is allowed to influence. If you use the same data to both choose a model and judge it, you’ll tend to “discover” improvements that are actually just luck and overfitting to the evaluation set.

A standard setup uses train / validation / test sets. You fit parameters on the training set. You use the validation set for model selection—choosing hyperparameters, deciding which features to keep, selecting a classification threshold, or comparing algorithms. You use the test set once at the end as a reality check: it approximates how the chosen approach might perform on new data. The test set is valuable because it’s “clean”—as long as you haven’t looked at it repeatedly while making decisions.

You also need to choose how to split. Random splits are common, but they can be invalid in time-dependent problems (fraud, churn, demand forecasting) where the future differs from the past. In those cases, a time-based split (train on earlier periods, validate on later periods, test on the most recent period) better mirrors deployment. Another common issue is grouped data: if multiple examples come from the same customer, user, device, or property, you often need a grouped split so all rows for an entity stay in one partition. Otherwise your model can “recognize” the entity across splits, and you will overestimate generalization.

Here’s the clean division of labor to keep in your head:

Decision / artifact Training set Validation set Test set
Fit model parameters (weights, splits, etc.) Yes: parameters are learned here No No
Choose hyperparameters (depth, regularization, k, etc.) No: don’t “pick” using training score alone Yes: main purpose No
Choose a threshold (e.g., fraud alert cutoff) No Yes: tuned to business tradeoffs No
Report final performance Not the headline number Sometimes for iteration, but not final Yes: final unbiased estimate (if used once)
Common pitfall Believing high training performance means “good model” Iterating until you “win” on validation (overfitting to validation) Peeking repeatedly, then calling it “final”

[[flowchart-placeholder]]

A frequent misconception is that evaluation is just “compute a metric.” In practice, evaluation is a process control problem: you are trying to prevent information from the future (or from your own decisions) from leaking into what you treat as an unbiased estimate. That’s why reproducible experiment tracking matters: if you can’t tell whether a result changed because data changed, hyperparameters changed, or you simply got lucky on a split, you don’t have a stable evaluation story.

Metrics that match the decision: accuracy is rarely enough

Beginner evaluations often default to accuracy for classification because it’s simple: “what fraction of predictions are correct.” Accuracy is valid when classes are balanced and mistakes have similar cost. But in many data science problems—fraud, churn, defect detection—the positive class is rare, and the costs are asymmetric. A model can be 98–99% accurate by predicting “no” for everyone and still be useless.

Instead, you want metrics that surface the tradeoff between catching positives and avoiding false alarms. The building block is the confusion matrix: true positives (TP), false positives (FP), true negatives (TN), false negatives (FN). From it, you derive:

  • Precision = TP / (TP + FP): among flagged cases, how many are truly positive?

  • Recall = TP / (TP + FN): among true positives, how many did we catch?

  • F1 score: a single number balancing precision and recall (useful when you want a compromise).

  • ROC-AUC: measures ranking quality across thresholds; can look optimistic on highly imbalanced data.

  • PR-AUC: focuses on performance for the positive class; often more informative when positives are rare.

Metrics aren’t just math; they encode the operational truth of what happens after the model predicts. If a fraud team has limited investigators, high recall with terrible precision can overwhelm the queue. If a retention team can only intervene for a small cohort, precision matters because every false positive wastes budget and annoys customers. Conversely, if missing a positive is extremely costly (e.g., safety incidents), recall may dominate. The best practice is to define evaluation in the same units as the decision: “How many true cases do we catch per 100 reviews?” or “What is the expected cost per 10,000 customers scored?”

A second misconception is treating threshold choice as an afterthought. Many models output a score or probability, and the threshold is part of the model you ship in practice. You usually set it using validation data and business constraints (review capacity, budget, acceptable risk). If you pick a threshold on the test set, you’ve used test information to optimize the system, and your test metric is no longer a clean estimate.

Two kinds of “bad evaluation”: leakage and mismatch

Evaluation fails most often for two reasons: information leakage and mismatch between evaluation and deployment. Leakage includes obvious future features (like a “chargeback filed” flag used to predict fraud) but also subtler forms: using aggregate features computed over the full dataset (including the future), or splitting correlated examples so that near-duplicates appear in both train and test. The result is misleadingly high offline performance.

Mismatch is trickier because it can happen even with honest splits. If your example unit differs from the decision unit, your metric can reward behavior that doesn’t translate to the real workflow. For instance, if you predict churn at a customer-month level but the business intervenes weekly at a customer level, you might overcount positives, mis-measure lift, or double-target the same customer. Similarly, if your labels are proxies (like “canceled within 30 days”), operational delays or policy changes can shift the relationship between features and label. A model can “generalize” to the proxy while failing the true intent.

A practical best practice is to write an evaluation checklist that mirrors the dataset contract and training knobs:

  • Example integrity: does each row correspond to the real decision moment?

  • Feature availability: is every feature known at prediction time, computed “as-of” the decision date?

  • Label alignment: does the label represent the outcome you truly care about, measured consistently?

  • Split validity: is the split time-based or grouped when needed to prevent leakage?

  • Selection discipline: are hyperparameters and thresholds chosen on validation only, with test used once?

When one of these breaks, you can get evaluation numbers that are not just slightly wrong—they can be directionally wrong, causing you to choose the wrong model family or the wrong capacity settings. That’s why evaluation is not a final step; it’s part of designing the learning system.

Applied example 1: Fraud detection with imbalanced data and threshold decisions

Imagine a payments fraud model where each example is a transaction, features are known at authorization time, and the label is whether the transaction later becomes confirmed fraud. Fraud is typically rare, so accuracy can be deceptive. If only 1% of transactions are fraud, a model that predicts “not fraud” for every transaction is 99% accurate—and catches zero fraud. That “great” metric would translate into no protection in production.

A more useful evaluation starts with a time-aware split to reflect concept drift and changing attacker behavior: train on earlier weeks, validate on later weeks, test on the most recent weeks. Then you evaluate using precision/recall and operational quantities. For example, suppose reviewers can manually inspect 500 transactions per day. You can turn model scores into a ranked list and choose a threshold that yields roughly 500 daily alerts on the validation period. You then measure: among those 500, what fraction are truly fraudulent (precision), and what fraction of all fraud you catch (recall). This frames the model as a decision support system, not a pure classifier.

Limitations show up quickly in this framing. If precision is low, the review queue gets noisy and trust erodes. If recall is low, the business still loses money to chargebacks. You can tighten the threshold to raise precision, but recall will fall; loosen it to raise recall, but the queue may overflow. The point of evaluation is to make that tradeoff explicit and choose thresholds and models that match constraints. Hyperparameters matter here too: increasing capacity might raise training performance but can reduce stability and inflate false positives when attackers change tactics.

Applied example 2: Churn prediction with grouped entities and time splits

Consider churn prediction where each example is a customer-month, features summarize behavior up to an as-of date, and the label is whether the customer cancels in the next window. This setup is particularly vulnerable to evaluation inflation if you randomly split rows: the same customer can appear in both train and test months, allowing the model to learn customer-specific signatures (plan type, long-term usage level, region) and effectively “recognize” them later. Your test score looks great, but you’ve measured partial memorization, not generalization to new customers or new periods.

A stronger evaluation design uses either a grouped split by customer (so all months for a customer are in one partition) or a time-based split (train on earlier months, test on later months), depending on what “generalization” means for the business. If the goal is to predict future churn for the same customer base, time splits are often the closest mirror of deployment. If the goal is to generalize to new cohorts or acquisition channels, grouping by customer can be essential. Sometimes you need both: for example, train on months 1–9 for a set of customers, validate on month 10 for a disjoint set, and test on month 11 for another disjoint set.

Once splits are valid, metric choice follows the business action. If the retention team can only target 5% of customers each month, evaluate precision at top-K (how many true churners are in the top 5% risk scores) or PR-AUC. Then choose the action threshold on validation data to match targeting capacity. This evaluation approach also surfaces model stability: if small retrains produce huge swings in who is targeted, you may be overfitting or relying on fragile features. In that case, you might prefer simpler models or stronger regularization—not because they maximize a single metric, but because they deliver a more reliable system.

A checklist you can trust for evaluation basics

Evaluation is your reality anchor: it tells you whether your model is learning general patterns or just exploiting quirks of your data and process. The simplest way to stay honest is to control information flow: decide what data is allowed to influence training, selection, and final reporting, and make sure that matches how the model will be used.

Key points to carry forward:

  • Generalization is performance on genuinely unseen data that reflects the real decision context.

  • Use train/validation/test with discipline: train fits parameters, validation selects hyperparameters and thresholds, test is the one-time final check.

  • Choose splits that prevent leakage: time-based for temporal problems, grouped for repeated entities.

  • Choose metrics that match the decision, especially under class imbalance and asymmetric costs.

A solid foundation for real ML work

  • You can now define “good model” as generalizes under the right split and the right metric, not just “high accuracy.”

  • You have a practical process for keeping evaluation honest: separate training, tuning, and final testing to avoid self-deception.

  • You can spot the most common evaluation failures early: leakage, invalid splits, and mismatched labels/example units.

  • You can connect evaluation outcomes back to controllable causes in ML systems: data definitions and hyperparameters that govern capacity and stability.

When you can evaluate models cleanly, you can iterate with confidence: improvements become real, regressions become explainable, and shipping a model feels like engineering—not guesswork.

Last modified: Tuesday, 17 February 2026, 11:45 AM