Why “great accuracy” can still fail in production

A retention team asks for a churn model “as soon as possible.” You pull historical data, train a model, and it reports 95% accuracy on a test split—everyone relaxes. Then the model goes live and the outreach list looks random, the retention team wastes calls, and trust evaporates. What happened?

Most of the time, it’s not the algorithm. It’s how you split data and how you measure success. A sloppy split can leak future information into training, and a sloppy metric can reward the wrong behavior (like always predicting “no churn” when churn is rare). Offline evaluation becomes a confidence trick—unintentional, but expensive.

This lesson makes evaluation honest and decision-aligned: split your data in a way that matches deployment, and choose metrics that reflect the real cost of errors and the operational constraints (like “top 500 accounts” outreach or “review top 0.5% of transactions”).

The evaluation contract: splits, metrics, and what “good” means

To evaluate an ML model, you need two things to be true at the same time:

  • Your split simulates reality: training uses only what would have been known before prediction time, and testing reflects the future the model will face.

  • Your metric matches the decision: it measures what you actually care about—ranking the right people, catching enough fraud, keeping false alarms manageable, or producing well-calibrated probabilities for thresholding.

Key terms you’ll use throughout:

  • Train / validation / test: datasets used to learn patterns (train), tune choices (validation), and estimate final performance (test). The test set is your “do we trust this?” check and should stay untouched until the end.

  • Random split vs. time-based split: random splits shuffle rows; time-based splits preserve chronology (past → future). In ML for business decisions, time usually matters.

  • Stratification: ensuring class proportions (like churn rate) are similar across splits so comparisons are stable.

  • Target leakage: including information in training that wouldn’t exist at prediction time. Leakage often shows up as suspiciously strong metrics.

  • Class imbalance: when one outcome is far more common (e.g., 98% non-fraud). Some metrics become misleading under imbalance.

An analogy that holds up in practice: problem framing is the contract, data readiness is the audit, and evaluation is the courtroom test. If you pick the wrong “jury” (split) or the wrong “verdict” (metric), you can’t trust the outcome—even if the model looks mathematically impressive.

Splits that prevent leakage and tell the truth

Random splits: fast, common, and often too optimistic

A random split (say 80/20) assumes each row is independent and identically distributed. That’s occasionally fine—especially for static, non-temporal problems where each example truly stands alone. But most data science use cases quietly violate that assumption: customers appear multiple times, behavior evolves over time, and the business changes (pricing, product, fraud patterns). Random splits can accidentally let near-duplicates or future states of an entity land in both train and test.

The biggest risk is time integrity. If you build features like “total tickets ever” or an aggregate computed over the full dataset, a random split won’t catch that you used future information. Even if every feature has a timestamp, joins can pull in “last known status” that updates after the outcome, or aggregates can be computed without an “as of” boundary. Random splits can then produce inflated AUC/accuracy and give you a false sense of security.

Best practice when you do use random splitting is to be explicit about the assumptions. Ask: are we okay with training on patterns that might reflect the future? Are there repeated entities? Are features point-in-time correct? If the answer is “not sure,” the safe default is to stop treating the dataset as a bag of rows and start treating it as snapshots through time.

Time-based splits: the default for most business ML

A time-based split mimics deployment: you train on the past and test on the future. This directly supports the feature availability boundary you established during data readiness—features before the prediction point, labels after. It also reveals drift: if behavior changes, performance drops on later windows, which is exactly what you want to see before shipping.

A typical pattern is:

  1. Choose a prediction cadence (e.g., weekly Monday scoring).
  2. Build training snapshots over an earlier period (e.g., Jan–Sep).
  3. Use a later period for validation (e.g., Oct–Nov).
  4. Reserve the latest mature-labeled period for the final test (e.g., Dec), excluding windows where labels aren’t complete (label maturity).

This split design directly addresses two earlier risks: leakage (by enforcing “as of” computation) and label maturity (by avoiding right-censoring where recent outcomes haven’t arrived yet). A common misconception is that time-based splits are only for “time series forecasting.” In practical ML, they’re for any system where predictions are made repeatedly over changing reality—which is most production ML.

Group-based splits: when the same entity appears multiple times

Sometimes time isn’t the only leakage path. If you have multiple rows per customer/account/device/merchant, a random split can put the same entity in train and test. The model may learn entity-specific quirks (like a customer ID’s historical behavior) and look great offline, then generalize poorly to truly new entities.

A group split ensures that all rows for an entity go into a single split. In churn, the group might be account_id; in fraud, it might be card_id, device_id, or merchant_id (depending on your deployment and what “new” means). This is especially important when features include historical aggregates for the entity: if you learn an entity’s “signature” in training and test on the same entity, you’re not measuring generalization—you’re measuring memory.

In real systems, you often combine ideas: time-based + group constraints. For example: group by account within each time window, or ensure the test period is later in time and also contains entities that weren’t seen in training if that matches your use case. This is where evaluation becomes a design choice, not a checkbox: your split should match the question “what kind of generalization do we need?”

Split strategies compared (and when each lies to you)

Dimension Random split Time-based split Group-based split
Best for Quick baselines when rows are truly independent and the world is stable Most business ML where deployment predicts on the future Datasets with repeated entities (customers, devices, accounts)
Biggest pitfall Leakage and near-duplicate contamination inflate metrics If labels are delayed, recent windows become mislabeled (right-censoring) Can reduce training size and increase variance if groups are few/large
What it can falsely “prove” “We’re great at predicting,” when you’re using future info or memorizing entities “We’re worse than we are,” if you accidentally include immature labels “We generalize,” when the real deployment reuses known entities heavily (mismatch)
Best practices Stratify; audit leakage; avoid post-outcome and look-ahead aggregates Enforce point-in-time features; exclude immature label periods; validate on multiple future windows Choose the right grouping key; check group distribution; combine with time when needed

Metrics that match the decision (not your ego)

Accuracy is rarely the right headline metric

Accuracy asks: “What fraction did we get right?” That sounds reasonable until you hit class imbalance. If only 2% of transactions are fraud, a model that predicts “not fraud” for everything gets 98% accuracy—and is useless. The same issue appears in churn when churn is relatively rare at the chosen horizon.

Accuracy can still be useful as a secondary statistic in balanced problems, but it shouldn’t drive decisions in most applied ML settings. The deeper issue is that accuracy weights all errors equally, while business costs are almost never equal. A false decline in fraud can be far more expensive than a missed fraud review, and a false churn alarm wastes outreach capacity while a missed churn risk might lose revenue.

A better habit: decide what “bad” looks like operationally (wasted calls, angry customers, chargebacks) and then pick metrics that measure that.

Confusion-matrix metrics: precision, recall, and why thresholds matter

For binary classification, the confusion matrix breaks outcomes into true positives, false positives, true negatives, and false negatives. From that you get:

  • Precision: of the cases you flagged, how many were truly positive?

  • Recall: of all true positives, how many did you catch?

  • F1 score: a balance of precision and recall (useful when you need both, but still hides business costs).

These metrics depend on a threshold (e.g., classify as churn risk if probability > 0.7). That threshold is not a “model detail”—it’s a policy decision. If you have capacity to contact only the top 500 accounts weekly, you may choose a threshold that yields about 500 predicted positives, or—more robustly—rank by score and take the top 500 without a fixed threshold.

Common misconception: “A high AUC means we can just pick a threshold later.” Sometimes true, but thresholds can behave differently across time and across slices if calibration drifts. You should examine precision/recall at the operating point you actually plan to use (top-k or specific review rate), not only at an abstract global metric.

Ranking metrics: when you act on the top of the list

Many business decisions are ranking problems in disguise. If you can only intervene on a limited number of cases, you care most about whether the highest-scored cases are truly the ones you want. Two useful perspectives:

  • Precision@K: among the top K cases, how many are truly positive?

  • Recall@K (or capture rate): what fraction of all positives did you capture by looking at the top K?

These are directly aligned with scenarios like “top 500 outreach” or “review top 0.5%.” They also expose a failure mode that aggregate metrics can hide: a model might have decent overall AUC but still put too many false positives in the top of the list, which is where your team spends money.

Ranking metrics are only as honest as your split. If your split leaks future info, the top of the list will look magical offline and collapse in production. That’s why evaluation is a system: splits and metrics must be designed together.

Probability metrics: when you need trustworthy scores, not just ordering

Sometimes the decision depends on the score meaning what it claims. If a model says “0.8 churn probability,” does that group actually churn about 80% of the time? That’s calibration. Good calibration supports:

  • Setting thresholds that achieve stable review rates.

  • Comparing risk across cohorts and time.

  • Estimating expected cost/benefit (e.g., outreach ROI).

A key metric family here is log loss (and related proper scoring rules). These reward models for assigning high probability to events that occur and low probability to events that don’t, while penalizing overconfidence. They can be more sensitive than AUC to probability quality.

A subtle pitfall: a model can rank correctly (high AUC) but be poorly calibrated (probabilities too extreme or too timid). If your operation uses hard policies (auto-decline above 0.99, manual review above 0.90), calibration errors can create sudden spikes in false declines or review volume.

Metrics compared (so you don’t optimize the wrong thing)

Dimension Threshold metrics (Precision/Recall/F1) Ranking metrics (Precision@K/Recall@K) Probability metrics (Log loss/Calibration)
Best for When you will classify using a threshold and costs are asymmetric When capacity is fixed (top K outreach, top % review) When you need scores to reflect real-world probabilities
What they reward Making the chosen “positive” set clean (precision) or comprehensive (recall) Getting the top of the list right Being right and appropriately confident
Common pitfall Picking a threshold using the test set, then reusing it (overfitting evaluation) Celebrating top-K performance that comes from leakage or entity overlap Ignoring drift: calibration can degrade over time even if AUC stays similar
Best practices Choose thresholds on validation; report multiple operating points Report K that matches operations (500, 0.5%); use time-based splits Check calibration by time and slice; recalibrate if needed

Two end-to-end evaluation walk-throughs (churn + fraud)

Example 1: Weekly churn scoring for top 500 outreach (30-day horizon)

You score every Monday morning and the retention team can contact only 500 accounts. Evaluation should therefore answer: “If we pick the top 500 each week, are they meaningfully enriched for future churn compared to random?” That is a ranking question with a time boundary.

Step-by-step, you design splits to simulate deployment. Build weekly snapshots: one row per eligible active paying account (e.g., tenure ≥ 60 days), features computed as of Sunday night, and a label “canceled within the next 30 days.” Then use a time-based split: earlier months for training, later months for validation, and the most recent label-mature months held out for the final test. If cancellations are recorded late, you exclude the most recent weeks where the outcome is not fully observed; otherwise you accidentally reward models for predicting “no churn” in the censored window.

Now align metrics with operations. You report Precision@500 and Recall@500 per week (or averaged across test weeks). You also track how stable these are across time: one strong week and three weak weeks might average out, but it’s operationally painful. Limitations matter: if event logging is missing for a key cohort (say enterprise accounts), performance might look good overall but fail on the revenue-critical slice, so you also report metrics by segment to detect the “coverage and missingness” risk showing up as evaluation instability.

Example 2: Real-time fraud scoring with review top 0.5% and strict false-decline constraints

At authorization time, you score each transaction. You send the top 0.5% to manual review, and you may also auto-decline at a higher threshold—but false declines are extremely costly. Evaluation must therefore address two operating points: the review queue (ranking) and the decline rule (threshold + probability quality).

First, the split: use a time-based split that respects label maturity. If fraud is confirmed via chargeback within 90 days, then the most recent 90 days are partially unlabeled (right-censored). Train on older windows with mature labels, validate on later mature windows, and test on the latest window that is still fully matured. This prevents the model from being “rewarded” for predicting non-fraud simply because the label hasn’t arrived yet.

Then choose metrics that reflect the queue and policy. For the review process, report Precision@0.5% (or Precision@K where K is 0.5% of transactions) and capture rate of fraud in that top slice. For auto-decline, you examine recall at a very low false positive rate (or explicitly track false-decline rate) and you pay attention to calibration: if the probability scale drifts, a fixed threshold can suddenly decline too many good customers. You also evaluate by issuer/region slices where device signals might be missing; otherwise, you risk a model that is “great” on well-instrumented traffic and chaotic on the segments that generate the most disputes.

A checklist you can trust before you quote a metric

A clean evaluation setup usually means you can answer these questions without hand-waving:

  • Does the split match deployment? Past → future, point-in-time features, and entity leakage addressed when relevant.

  • Are labels mature in the evaluation window? No right-censoring hiding in “recent data.”

  • Is the metric aligned to action? Top-K metrics for capacity-limited workflows; probability quality if thresholds/policies depend on it.

  • Are you measuring stability? Performance across time windows and key slices, not just one overall number.

  • Did you keep the test set sacred? Thresholds, feature decisions, and tuning happen on training/validation—not on the final test.

From framing to trustworthy evaluation

  • Strong ML projects start with clear framing: target, unit, horizon, and the decision you’ll make with predictions.

  • They stay honest through data readiness: point-in-time boundaries, leakage audits, label maturity, and coverage checks.

  • They earn trust with evaluation that matches reality: time-aware (and sometimes group-aware) splits plus metrics tied to operational constraints.

When you can explain your split and metric choices in plain language to a stakeholder—and they still agree it’s fair—you’re no longer just training models. You’re building decision systems people can rely on.

Last modified: Thursday, 19 February 2026, 8:46 AM