Mini Scenario Review: Task to Metric
A Monday-morning ML request: “What metric should we use?”
Imagine you’re in a churn stand-up. A stakeholder says: “We trained a model—what metric should we report?” The question sounds like it belongs at the end, but it’s actually a scope question: the right metric depends on the task, the target definition, and the decision the business will make with the predictions.
This matters now because beginner ML projects often “work” in notebooks yet fail to create value. The failure is rarely the algorithm. It’s usually a mismatch—like using accuracy for rare events, or measuring ROC-AUC when the real operation is “call the top 5,000 customers.” This lesson is a mini scenario review that trains one habit: move from task → target → evaluation plan → metric → decision rule without skipping steps.
By the end, you should be able to look at a problem statement and quickly propose a metric that’s defensible—not just technically valid.
The compact vocabulary that keeps you honest
A few terms do most of the work in this lesson:
-
Task type: What form the prediction takes—most commonly classification (yes/no), regression (number), or ranking/prioritization (ordered list).
-
Target (label) definition: The precise rule that turns the real world into y (for example, “canceled within 30 days after prediction date”).
-
Metric: How you score performance on held-out data (for example, precision, recall, PR-AUC, MAE).
-
Decision rule: How predictions become actions (for example, “contact top K,” or “flag if probability > threshold”).
-
Generalization: Performance on unseen data; the reason we split into train/validation/test.
-
Data leakage: When training/evaluation uses information that wouldn’t exist at prediction time (an “answer key” sneaks in).
A helpful analogy is still: training data = homework, test set = exam, metric = grading rubric. If the exam contains leaked answers (leakage) or the rubric rewards the wrong behavior (misaligned metric), the score becomes meaningless. The scenario review below uses that mental model to keep each choice consistent with the real decision.
From task to metric: the chain you can’t break
Task type decides what “good” even means
The most common beginner move is to treat every problem as classification and then ask which metric is “best.” A more reliable sequence starts earlier: what decision is being supported? If the decision is “approve/deny,” a classifier with a threshold might fit. If the decision is “who do we contact first with limited capacity,” the core output is an ordering, and ranking-style evaluation becomes central.
A key misconception is that “predict churn” automatically means “binary classification.” Often, churn work is really a prioritization job: the business can’t intervene for everyone, so it needs the top slice where action is possible. In that world, overall accuracy or even overall AUC can hide what you actually care about: quality at the top of the list.
Best practice is to write the task in operational language:
-
“Each week, produce a list of 5,000 customers to contact” (ranking).
-
“For each transaction, decide whether to hold for review” (classification with threshold).
-
“Forecast next month’s demand in units” (regression in real units).
Once you do that, you’ve already narrowed the metric family. Ranking tends to push you toward precision@K / recall@K (or similar top-K metrics). Threshold decisions push you toward precision/recall tradeoffs at a chosen operating point. Regression pushes you toward MAE/RMSE in meaningful units, often compared to a baseline.
Target definition sets the playing field (and can quietly sabotage metrics)
Metrics don’t evaluate “the real world.” They evaluate your target definition—the rule you used to label examples. That’s why target definition is not paperwork. It determines class balance, what features are valid, and whether the model can be used at the moment of decision.
Consider churn. “Canceled within 30 days” is different from “no purchase in 60 days.” Those labels create different positives, different levels of ambiguity, and different opportunities for action. They also change the base rate, which directly affects how metrics behave. A model can show “great accuracy” simply because the positive class is rare; that doesn’t mean it’s good at finding churners.
Leakage often enters through target definition and timing. If your target is defined using an event that happens after investigation (like “fraud confirmed”), you must ensure your features are computed strictly from information available before confirmation. If you accidentally include “chargeback filed,” you’ve added a near-direct proxy for the label—your metric will look stunning, and deployment will collapse.
Best practice is to explicitly state two timestamps:
-
Prediction time: when the model would run in production.
-
Outcome window: how far into the future you define the label (30 days, 7 days, etc.).
That simple discipline clarifies which features are legal and helps you select a split strategy that matches the future you care about.
Evaluation design protects generalization—and changes which metric is trustworthy
Even a perfectly chosen metric is only meaningful if your evaluation setup matches reality. Random splits can be fine when observations are independent, but many data science datasets violate that assumption: time trends, repeated customers, and policy changes.
Two common failure modes show up in scenario reviews:
-
Time mismatch: You train on mixed months and test on mixed months, but the deployed model will always predict on future months. Your metric becomes optimistic because it doesn’t face real drift and seasonality.
-
Entity leakage: You have multiple rows per customer. If you split rows randomly, the same customer can appear in both train and test, letting the model “recognize” identity patterns rather than learn general churn signals.
Best practice is to treat splitting and preprocessing as one integrity system:
-
Split first (often time-based or group-based when appropriate).
-
Fit preprocessing on training only.
-
Validate choices on validation only.
-
Touch the test set last.
Misconception to avoid: “AUC is immune to threshold choices, so it’s safe.” AUC can still be misleading if your split leaks or if the operational goal is performance at a constrained contact/review rate. A metric is only “safe” when it matches the decision and your evaluation mirrors deployment.
A practical comparison: metric choices are contracts about error costs
Use this table as a quick “scenario translator.” It doesn’t tell you one perfect metric; it tells you which questions you must answer to pick a metric that matches the decision.
| Decision reality | Classification (threshold) | Ranking / Prioritization (top-K) | Regression (numeric forecast) |
|---|---|---|---|
| What the business actually does | Takes an action if a case is predicted positive (flag, approve/deny, escalate). You must pick an operating point. | Has limited capacity and can only act on a fixed number (or fraction) of cases. Ordering matters more than a single threshold. | Uses predicted numbers to plan inventory, staffing, or revenue targets. Error magnitude matters in real units. |
| Targets that usually fit | Binary label within a time window (fraud in 24h, churn in 30d). Must be definable at training time without future features. | Often the same binary label, but treated as a ranking signal (who is most at risk / most valuable). Sometimes combined with value to rank by impact. | A continuous outcome (next-month units, spend, time-to-event proxy). Must match the unit that decisions use. |
| Metrics that align | Precision/Recall, F1, PR-AUC, plus precision/recall at a chosen threshold. Track false positives/negatives explicitly. | Precision@K, Recall@K, sometimes PR-AUC as a supporting view. Always define K from capacity (call center, review team). | MAE (in units), RMSE (penalizes large errors). Compare to baselines and consider tolerance bands. |
| Typical pitfall | Reporting accuracy on imbalanced data and declaring success while missing positives. | Celebrating a strong overall AUC while the top-K list is noisy and operationally useless. | Reporting a low error without translating it into business impact (is 50 units “good” or disastrous?). |
[[flowchart-placeholder]]
Two mini scenario walk-throughs (task → target → metric)
Scenario 1: Churn “prediction” with a weekly contact limit
A subscription company asks for churn prediction, but the retention team can contact 5,000 customers per week. That constraint immediately shapes the task: you don’t just want “churn vs. not churn” for everyone; you want the best possible ordered list so the team spends effort where it counts.
Start by defining the target with time clarity: “Customer cancels within 30 days after the prediction date.” Then sanity-check feature availability: at prediction time, you know current plan, usage counts up to yesterday, and existing support tickets—but you don’t know future billing outcomes. With that, choose an evaluation split that matches reality: if churn behavior changes over time, a time-based split (train on earlier months, validate on later months) is more honest than a random split.
Now pick metrics that match the operational decision. You can track PR-AUC because churn is often rare and PR curves focus on positive-class quality. But the headline metric should reflect the actual workflow: precision@K, where K = 5,000. This answers the real question: “Among the 5,000 people we contact, how many truly churn without intervention?” The benefit is clarity: the metric ties directly to staffing capacity and expected impact.
The limitation is also important: precision@K can look decent even if the model ignores customers who would respond best to intervention. In many organizations, the “best” ranking is not just “most likely to churn,” but “most likely to churn and most likely to be saved.” You may still start with churn likelihood, but you should be aware that the metric is measuring a proxy for value, not value itself.
Scenario 2: Fraud detection where “great metrics” might be a warning sign
A payments team wants to flag potentially fraudulent transactions. Fraud is typically far below 1%, so accuracy becomes almost meaningless: predicting “not fraud” for everything can yield 99%+ accuracy. Here, the task is usually classification with a tight review capacity (which makes it behave like ranking at the top). The decision might be: “Send the top 0.5% riskiest transactions to manual review,” or “Hold a transaction if risk > threshold.”
Define the target carefully: “Fraudulent as determined by investigation outcome,” but recognize that investigation completes later. That’s where leakage risk spikes. A feature like “chargeback filed” or “fraud confirmed” is effectively the answer key. If it slips into training—even indirectly through aggregated features computed using the full dataset—your validation PR-AUC can look unrealistically high, and you’ll ship a model that fails the moment it faces truly live data.
A robust evaluation makes the metric trustworthy. Use a time-aware split so the model is tested on later periods, and compute features using only information available up to each transaction’s timestamp. Then choose metrics aligned to operations: precision and recall at the alert rate you can handle, and often PR-AUC as a global summary because the positive class is rare. The benefit is that strong performance under these constraints is believable.
The limitation is that honest evaluation may look “worse” than leaky evaluation—and that’s good. It reveals the true difficulty and forces you to improve legal features, adjust thresholds, and coordinate with operations, instead of trusting a metric that was inflated by future information.
The five-question checklist that turns scenarios into metrics
When someone hands you a dataset and asks for a metric, run this quick checklist:
-
What decision follows the prediction? That tells you whether you need a threshold, a top-K list, or a numeric forecast.
-
What is the exact target and outcome window? Labels define what your metric can legitimately claim.
-
What information exists at prediction time? This is your simplest leakage test.
-
Does your split match deployment (time, groups, repeated entities)? A perfect metric on a flawed split is still untrustworthy.
-
What errors are expensive? This determines whether you prioritize precision, recall, top-K quality, or error magnitude.
If you can answer those, “Which metric should we use?” stops being a guessing game. It becomes a disciplined translation from a real operational constraint into a score you can defend.
In the next lesson, you’ll take this further with Next Steps & Learning Roadmap [15 minutes].