Why vocabulary is your first ML “debugging tool”

You’re in a product analytics meeting and someone says: “Let’s use ML to predict churn. We’ll train on user activity, then deploy to production.” A few minutes later, another person asks: “What’s the label? What features are you using? How will you validate it? Is this supervised or unsupervised?” If those words feel fuzzy, it’s hard to even ask the right questions—let alone build something reliable.

ML projects fail surprisingly often for reasons that have nothing to do with fancy algorithms. They fail because teams miscommunicate: they mean different things by “prediction,” they confuse a feature with a label, or they report “accuracy” when the business cares about false positives. Vocabulary isn’t academic here—it’s operational. It’s how you prevent building the wrong thing and confidently explain what you built.

This lesson gives you a compact set of core ML terms you’ll hear constantly in data science work, plus the practical intuition for how they connect.

The smallest set of terms that unlocks most ML discussions

At a high level, most ML conversations boil down to one repeatable pattern:

  • What are we trying to predict or decide? (the target/label)

  • What information do we have at decision time? (the features)

  • What mapping learns from data? (the model)

  • How do we know it works? (the evaluation)

  • What happens when the world changes? (the drift/monitoring)

If you keep that mental checklist, the vocabulary becomes a map rather than a list of definitions.

Here are the “must-know” definitions you’ll use throughout the course:

  • Example / row / instance: One unit the model learns from (one customer, one transaction, one email).

  • Feature (X): An input variable used to make a prediction (e.g., days since last purchase).

  • Target / label Yes: The outcome you want to predict (e.g., churned in next 30 days: yes/no).

  • Model: The learned function that maps features to predictions (e.g., logistic regression, tree model).

  • Training: Fitting the model to historical examples so it learns patterns.

  • Inference / prediction / scoring: Using the trained model to generate outputs for new examples.

  • Evaluation metric: A number that summarizes performance (e.g., precision, recall, AUC).

  • Generalization: How well performance transfers from training data to new, unseen data.

You’ll notice that none of these terms says “deep learning” or “neural networks.” That’s intentional: you need the workflow language first, because it applies to almost every ML approach.

Features, labels, and models: the core triangle

A clean way to understand ML vocabulary is to treat features, labels, and the model as a dependency chain.

A label is the thing you’re trying to learn to predict, but labels almost never appear “ready-made.” In a churn setting, “churn” sounds obvious until you have to define it: canceled subscription, no purchase in 60 days, stopped opening the app, or stopped being profitable? The label definition is not a technical footnote—it is the problem definition. If the label is inconsistent, delayed, or influenced by your own interventions, the model can look great in evaluation and still disappoint in production.

A feature is any input available at prediction time that you believe helps predict the label. Features can be raw fields (country, plan type) or engineered aggregates (purchases last 30 days, trend in session time). A common misconception is that “more features is always better.” In practice, features can leak information (accidentally include something that would not be known at decision time), encode bias, or destabilize the model when the data pipeline changes. Feature work is where a lot of real ML effort lives: representing the world in a way the model can use, without smuggling in future knowledge.

A model is the learned mapping from features to a prediction. It’s deterministic once trained, but unlike hand-written rules, you usually can’t read it as a short list of if-then statements. That’s why evaluation and governance matter so much: you validate behavior with metrics and slice analysis rather than by “eyeballing the logic.” A practical way to say it: features and labels define the game; the model tries to win the game you defined, using the data you gave it.

Training vs inference: when “learning” stops

People often use “the model learns” as if it’s continuously updating itself in production. That’s not how most ML systems work by default.

Training is the offline process where you give the algorithm historical examples (features + labels) and it finds parameters that reduce error on that data. During training you make choices like: which time window counts as history, how you handle missing values, and what objective you optimize. Training is where you can accidentally bake in mistakes at scale—like using a label that’s easy to compute but misaligned with what the business actually wants.

Inference (also called scoring or prediction) is what happens after training: you freeze the trained model and use it to score new examples. At inference time, you typically do not have the label yet—otherwise you wouldn’t need a prediction. That’s why the “available at decision time” idea is so important: anything that would only be known later cannot be a feature, even if it exists somewhere in your database.

A frequent pitfall is mixing training-time and inference-time realities, especially in churn or fraud. For instance, including “number of chargebacks next week” as a feature would be nonsense (it’s the future), but more subtle leakage happens all the time—like features derived from post-event support tickets that only occur after a customer is already unhappy. Leakage produces models that look unusually accurate in testing and then fall apart when deployed.

Datasets that mimic the real world: train, validation, test

To talk about “performance,” you need shared language for how data is split. The goal is simple: measure how well the model generalizes to unseen data.

  • Training set: The data used to fit the model’s parameters.

  • Validation set: Data used to tune choices (features, hyperparameters, thresholds) without “peeking” at the test set.

  • Test set: A final holdout used for an unbiased estimate of performance after decisions are made.

The key principle is that the validation and test sets act like a rehearsal for production. If the way you split data doesn’t reflect how the model will be used, your metrics can lie. A classic example: random splitting in a time-dependent problem (like churn) can accidentally put “future behaviors” into the training set that wouldn’t have existed at prediction time for earlier users. Time-based splits often better reflect reality: train on earlier months, validate on a later month, test on an even later month.

A misconception here is “we have a test set, so we’re safe.” You can still overfit at the project level, especially if you repeatedly evaluate and tweak until results look good. This is why teams treat the test set as sacred: once you’ve used it too many times to guide decisions, it stops being an honest mirror.

What the model outputs: scores, probabilities, and thresholds

Many beginner ML conversations get stuck on a false binary: “Will the customer churn, yes or no?” In practice, ML often produces a score that you convert into an action with a threshold.

A score is a continuous output (often between 0 and 1) that ranks examples by predicted risk or likelihood. In churn, you might score each user and then target the top 5% most at-risk users because budget is limited. This is one reason ML is attractive in messy decision spaces: it supports prioritization better than hard rules that create clumps of “flagged” users.

When people say a model outputs a probability, they may mean one of two things: a true calibrated probability (0.30 meaning “about 30% chance”) or simply a score that behaves like a probability but isn’t perfectly calibrated. Calibration is its own discipline, and in many applied settings, ranking quality matters more than perfect probability estimates. Still, you should be careful with language: if stakeholders interpret scores as literal probabilities, they will make incorrect cost and ROI calculations.

A threshold is the cutoff where you turn scores into actions: “if score > 0.7, intervene.” Thresholds are not purely technical; they reflect business trade-offs. Tightening a fraud threshold reduces false positives but may increase missed fraud. Loosening it catches more fraud but may block legitimate customers. Vocabulary helps here because it forces clarity: are you optimizing for fewer false alarms, fewer misses, or the lowest total cost?

How we talk about “good”: metrics in plain terms

Evaluation metrics are just formal language for trade-offs. The most useful beginner move is to tie each metric to an operational question.

Here’s a compact comparison of common metrics you’ll hear early and often:

Metric / concept What it measures When it’s useful Common mistake
Accuracy Fraction of predictions that match the label. When classes are balanced and costs of errors are similar. Using it for rare events (fraud, churn in a stable product) where “predict no” can be highly accurate but useless.
Precision Of the cases you flagged, how many were truly positive. When interventions are expensive (call center outreach, manual review). Improving precision by flagging very few cases and missing many true positives.
Recall Of the true positives, how many you caught. When missing positives is costly (fraud loss, safety issues). Maximizing recall without tracking the false positive burden on customers and ops teams.
ROC-AUC (AUC) Ranking quality across thresholds (how well positives score higher than negatives). When you want robust ranking and will choose a threshold later. Treating a higher AUC as automatically better business impact without checking costs and calibration.
Calibration Whether predicted probabilities match real frequencies. When decisions depend on expected value (budgeting, risk pricing). Assuming the model’s “0.8” means 80% without testing calibration on fresh data.

A best practice is to pick metrics that match the action. If you’re sending costly retention offers, precision matters. If the cost of missed fraud is extreme, recall matters. And if you’re allocating a fixed budget to the “top-risk” segment, AUC-like ranking metrics often align better than accuracy.

Another common pitfall is reporting one global metric and stopping there. Real systems need slice checks: performance by region, device, new vs returning users, plan tier, and other meaningful segments. This isn’t just fairness theater—it’s basic risk control. A model that works “on average” can still fail badly for a high-value or high-risk subgroup.

Overfitting, leakage, and drift: three ways ML can fool you

These three terms show up constantly because they describe the most common ways an ML result looks good on paper and fails in reality.

Overfitting happens when the model captures patterns that don’t generalize. With enough flexibility, a model can “memorize” quirks of the training data—noise, coincidences, or one-off events—and appear strong in training while being weak on new data. Overfitting isn’t just about complex models; it also happens when you tune endlessly on the validation set or select features based on what “worked last time” without a stable rationale. The remedy is disciplined evaluation, simpler baselines, and skepticism about surprisingly strong results.

Leakage is a specific, high-impact form of cheating: using information that would not be available at prediction time, often because it’s correlated with the label. Leakage can be subtle in business systems where timestamps and workflows are tangled. For example, a “support ticket created” feature might occur after churn intent begins, and a “refund processed” feature might directly reflect churn itself. Leakage is dangerous because it produces models that appear magical—until deployment, where the leaked signal disappears.

Drift is what happens when the world changes: user behavior shifts, product flows change, seasons impact demand, fraud tactics adapt, or data pipelines evolve. Your model’s training data stops resembling current reality, so performance degrades. Drift isn’t a one-time event; it’s a lifecycle fact. The best practice is to monitor both model outcomes (metric drops) and feature distributions (inputs changing), then retrain or recalibrate when needed.

A useful mindset is to treat these as three different questions:

  • Overfitting: “Did we learn real patterns or noise?”

  • Leakage: “Did we accidentally use the future?”

  • Drift: “Did reality move after we trained?”

[[flowchart-placeholder]]

Applied example 1: Defining churn carefully (and why vocabulary prevents bad labels)

Imagine a subscription app team wants “a churn model.” The first task is to define the label: what counts as churn, and over what horizon? A common operational definition is something like “canceled within 30 days,” because it’s measurable and directly tied to revenue. But suppose many users simply stop using the product without formally canceling. If you label only explicit cancellations, you train a model that misses “silent churn,” and your retention efforts may arrive too late.

Next comes feature design with the “decision-time” rule. If you want to intervene today, features must reflect what you know today: recency and frequency of sessions, trend in usage duration, plan changes, payment failures, and support interactions up to now. This is where leakage traps appear. For example, “refund issued” might be computed after cancellation is initiated; using it will inflate metrics in evaluation but won’t exist early enough to drive prevention.

Then you choose how to evaluate: maybe you care about precision because retention offers cost money, and you don’t want to annoy healthy customers. Or maybe you care about recall because churn is painful and you’d rather over-contact than miss true churners. If the business goal is “prioritize who to contact,” you might choose AUC-like ranking focus and operationalize by contacting the top-risk 5%. Notice how the vocabulary forces alignment: label defines the outcome, features respect timing, and metrics reflect intervention cost.

Limitations remain even with good vocabulary. Your churn label may be influenced by interventions (if you contact users, some won’t churn), which changes the data generating process. That doesn’t mean “don’t do ML.” It means be explicit about what the label measures and treat results as part of an evolving system, not a one-time prediction.

Applied example 2: Fraud screening as ranking plus policy (and why thresholds matter)

Consider transaction fraud screening. A rules system might already exist for hard constraints (e.g., block transactions that violate compliance or match known bad patterns). ML often enters as a way to rank the ambiguous middle: transactions that are not obviously fraud but collectively look suspicious.

Start with the examples: each transaction is a row. The label might be “chargeback within 60 days” or “confirmed fraud,” but those labels are delayed and sometimes noisy. That delay matters: you won’t know the true label at inference time, and you may need to evaluate on older periods to allow labels to mature. This is a vocabulary-driven planning step: you can’t evaluate what you can’t label reliably.

Now features: device fingerprints, velocity patterns (how many transactions in 10 minutes), merchant history, time-of-day anomalies, and account age. The best practice is to ensure features are stable in production and not dependent on post-transaction investigations. Leakage is common here too—for instance, a feature that reflects “manual review result” is essentially the label in disguise.

Finally, you choose how to act on scores. Fraud decisions often require thresholds with clear trade-offs. A high threshold yields higher precision (fewer legitimate customers blocked) but lower recall (more fraud slips through). A lower threshold catches more fraud but increases false positives and customer friction. Many organizations therefore use scores to route actions: auto-approve low risk, auto-decline high risk, and send the middle band to step-up auth or manual review. Vocabulary clarity here prevents a common failure: treating a score as a single “truth” rather than a tool for allocating different interventions.

The limitation is drift: fraudsters adapt. Even a strong model will degrade if tactics change, so monitoring and retraining become part of the system’s operating cost—not an optional extra.

The vocabulary checklist you’ll reuse in every ML project

Most early ML competence is being able to answer these questions clearly:

  • What’s the label, exactly, and when do we observe it?

  • What features are available at decision time (no leakage)?

  • What does the model output (score vs probability), and how will we threshold it into actions?

  • What metric matches the business cost of mistakes (precision, recall, ranking, calibration)?

  • How will we detect overfitting, leakage, and drift before they cause damage?

If you can speak in these terms, you can collaborate effectively with engineers, analysts, and stakeholders—and you can spot project risk long before a dashboard looks “off.”

In the next lesson, you’ll take this further with Learning Types & Data Science Use Cases [20 minutes].

Last modified: Thursday, 19 February 2026, 8:46 AM