Why ML projects fail before modeling even starts

Imagine you’re a data scientist at a subscription business. Leadership asks: “Can we use machine learning to reduce churn by 10% this quarter?” You have customer event logs, billing history, support tickets, and marketing emails—but no one agrees on what “churn” means, which customers “count,” or how success should be measured. One team wants a shiny dashboard, another wants a list of “at-risk customers,” and finance wants to see ROI. If you rush into training a model, you can end up optimizing the wrong outcome, creating a system nobody trusts, or shipping something that looks accurate but doesn’t change decisions.

Problem framing is the set of decisions that turns a vague business request into a precise ML task with clear boundaries, measurable outcomes, and an evaluation plan. It matters now because most ML risk and wasted effort comes from upstream ambiguity—not from picking the wrong algorithm. A good frame makes the rest of the ML workflow faster, safer, and easier to communicate.

Turning a business question into an ML task

A useful way to think about framing is: ML is not the goal; it’s a method. The goal is a decision that improves outcomes—ML only helps if it produces a prediction (or ranking) that fits that decision and can be evaluated honestly.

Key terms you’ll use throughout framing:

  • Target (label): The outcome you want to predict (e.g., “customer churned within 30 days”).

  • Prediction unit: What you’re predicting for (e.g., customer, account, transaction).

  • Prediction horizon: The time window you predict into (e.g., “within the next 30 days”).

  • Decision/action: What someone will do with the prediction (e.g., “offer retention discount,” “route to support”).

  • Success metric: How you judge helpfulness (often not the same as accuracy).

A practical analogy: framing is like writing a contract. If the contract is vague (“improve churn”), you can’t enforce it and you can’t tell if it worked. If it’s specific (“each Monday, flag the top 2% of accounts most likely to churn within 30 days, evaluated by lift and net revenue after discounts”), everyone can align on what to build and how to judge it.

Problem framing also forces you to confront trade-offs early. Do you want probabilities (for risk-based decisions), a ranking (for limited capacity outreach), or a yes/no classification (for strict operational triggers)? These are different products, even if they use similar models.

The framing checklist that prevents expensive surprises

Define the outcome, not the proxy

The most common early mistake is quietly switching from a business outcome to an easier-to-measure proxy. For churn, a proxy might be “no login in 14 days,” but that could punish seasonal users or customers who use the product via API. For fraud, “chargeback” is measurable but delayed and sometimes noisy. Proxies can be valid, but only if you explicitly accept the gap between proxy and truth.

A strong framing statement nails down three things: who, when, and what counts. “Predict churn” becomes: “For active paying accounts with at least 60 days tenure, predict whether they will cancel in the next 30 days, using only information available up to the prediction date.” That single sentence reduces later arguments about eligibility, leakage, and reporting.

Misconception to watch for: “We’ll figure out the label once we see the data.” In practice, the label defines the dataset you will build, the evaluation you can trust, and whether your model can be used operationally. You can iterate, but you need a first concrete definition to avoid building an inconsistent pipeline.

Best practices that keep this honest:

  • Write the target definition as a rule a SQL query could implement.

  • Specify inclusion/exclusion conditions (e.g., free trials, paused subscriptions, enterprise contracts).

  • Clarify how you handle ambiguous cases (refunds, chargebacks, reactivations).

Match the ML task to the decision you’re supporting

ML framing is not only “what to predict,” but also “how it will be consumed.” Many teams default to binary classification (“will churn: yes/no”) because it feels straightforward. But a binary output is often the least useful operationally, because real decisions have constraints: budgets, contact capacity, risk tolerance, fairness concerns, and customer experience.

A ranking is often better when capacity is limited (“call the top 500 accounts”). A probability is better when actions have different costs (“offer 20% discount only if expected value is positive”). A forecast is better when planning resources (“support ticket volume next week”). When the decision is unclear, teams end up optimizing the wrong thing: for example, chasing AUC improvements even though the business needs high precision in the top few percent.

Here’s a comparison that helps you choose the right formulation:

Decision reality Binary classification Probability / risk score Ranking / prioritization
What it outputs Yes/No prediction based on a threshold. Useful when actions are triggered automatically and must be consistent. Calibrated risk estimate between 0 and 1. Useful when different actions have different costs and benefits. Ordered list from most to least likely (or most valuable) cases. Useful when humans or systems have limited capacity.
What you optimize for Metrics like accuracy, precision/recall, F1, often at a chosen threshold. Threshold choice becomes a policy decision. Metrics that reward probability quality, like log loss or calibration-aware evaluation. Thresholds can still exist but are more flexible. Metrics focused on the top of the list like precision@k, recall@k, or lift curves. Great for triage and outreach.
Common pitfall Treating the threshold as “technical,” when it actually encodes business trade-offs. A model can look good overall but fail at the operating point. Confusing “0.8 risk” with “80% will churn” if calibration isn’t addressed. Scores can drift over time if the population changes. Celebrating better ranking while ignoring whether the top-k cases are actionable or whether interventions actually help.
When it fits best Clear rules, stable processes, and low ambiguity about what action follows. Cost-sensitive decisions, multiple interventions, or when you need to reason about expected value. Limited budget/time, human review queues, and “find the needles first” situations.

A helpful framing habit: write the action sentence explicitly—“If the model says X, we will do Y within Z days.” If you can’t write that, you’re not framed yet.

Lock down time: prediction point, horizon, and leakage boundaries

Time is where framing becomes technical, fast. ML models are extremely good at “cheating” if the dataset accidentally includes information that wouldn’t exist at prediction time. This is called data leakage, and it often enters through subtle paths: timestamps mishandled, post-outcome customer support notes, cancellation reasons, or aggregated features computed over the whole dataset rather than “as of” a date.

A robust frame defines:

  • Prediction point: When you generate the prediction (e.g., every Monday at 9am).

  • Feature availability: What data is available by then (e.g., events up to Sunday 11:59pm, billing status as of snapshot).

  • Outcome window: How far ahead you judge success (e.g., churn within 30 days after prediction).

Cause-and-effect matters here. If you include a feature like “account marked for cancellation” that is set by a downstream system after the customer initiates cancellation, your model will look brilliant in offline evaluation—because it’s effectively reading the answer key. In production, that feature may not exist at the prediction point, so performance collapses.

Typical misconception: random train/test split is enough. For many business problems, a random split mixes past and future and can inflate performance. A time-aware framing usually implies a time-based evaluation design (“train on earlier months, test on later months”) so you simulate real deployment conditions.

[[flowchart-placeholder]]

Choose success metrics that reflect business value (not just model fit)

A framing decision that saves months: define success in terms that match the decision, not in terms that are easy to compute. AUC can be useful for comparing models, but it doesn’t tell you if the top 200 flagged accounts are the right 200 when you have a call-center constraint. Similarly, accuracy can be meaningless when the positive class is rare (e.g., fraud or churn in certain segments).

Better framing aligns metrics with:

  • Operating constraints: e.g., “we can contact 1,000 customers per week.”

  • Costs of mistakes: false positives can waste budget or annoy customers; false negatives miss opportunities or allow harm.

  • Business impact: retention revenue, prevented loss, time saved, improved service levels.

A common best practice is to define at least one decision-aligned metric (like precision@k or net value) and one model-quality metric (like log loss or calibration error). This keeps the project honest: you’re not only building something that predicts well, but something that produces useful decisions.

Pitfall: picking a single metric too early and optimizing it blindly. For example, maximizing recall can flood operations with low-quality leads; maximizing precision can miss too many true cases. The right point depends on what actions are available and what they cost.

Describe the “shape” of the data problem before touching data

Even without opening a dataset, you can frame assumptions that determine feasibility. Your frame should state:

  • What the rows represent (customers, accounts, sessions).

  • How many positives you expect (roughly; even order-of-magnitude helps).

  • How feedback is generated (do you observe the outcome reliably, or only for some cases?).

This is where many real-world traps live. If outcomes are only observed when you intervene (e.g., fraud confirmed only after investigation), your labels are biased. If churn is defined by cancellation, accounts that silently stop paying might be missed. If your intervention changes behavior, the data you collect after deployment will differ from training.

A good framing document makes these uncertainties explicit and treats them as part of the problem, not as “data cleanup later.” Even a lightweight note like “Outcome may be delayed by 60 days due to billing cycles; labels will be right-censored for recent accounts” can prevent misleading evaluation and unrealistic timelines.

Two real-world framing walkthroughs

Example 1: Churn outreach for a SaaS product

Start with the raw request: “Predict churn so we can retain customers.” That sounds reasonable, but “churn” can mean cancellation, non-renewal, downgrade, or inactivity. The first framing move is to define the target in a way that matches an operational action.

A strong framed version might be: “Each Monday, score active paying accounts with at least 60 days tenure for probability of cancellation within 30 days, using only behavioral and account data available as of Sunday night. The retention team will contact the top 500 accounts with the highest risk, and success will be measured by incremental retention and precision@500.” This forces alignment on unit (account), timing (weekly snapshot), and capacity (top 500).

Now the trade-offs become visible. A weekly cadence implies you need stable, snapshot-friendly features and clear “as of” logic. The 30-day horizon matches how quickly outreach can happen and how long it takes to see cancellations. Metrics separate model quality from impact: precision@500 tells you whether the list is concentrated with true churners, while incremental retention tells you whether interventions help rather than simply targeting people who would have churned anyway.

Limitations are also explicit. If you only measure “did they churn,” you might optimize for predicting inevitable churners rather than identifying customers you can actually save. If discounts are offered, you may reduce revenue even when you retain accounts. That’s not a modeling problem—it’s a framing problem—so the frame should include how value is computed (e.g., retained revenue minus incentive cost).

Example 2: Fraud detection for card-not-present transactions

The raw request: “Use ML to detect fraud in real time.” Here, the decision is immediate and high-stakes: approve, decline, or route for step-up verification. Framing begins by tightening the operational definition: “For each transaction at authorization time, predict likelihood of being fraudulent as defined by confirmed chargeback within 90 days, using only data available at authorization. Use the score to route the top 0.5% highest-risk transactions to manual review and to decline above a higher threshold, subject to a maximum false-decline rate.”

This framing makes the evaluation constraints obvious. The label is delayed: chargebacks can arrive weeks later, so recent transactions have unknown outcomes. That means you’ll need to define how you handle incomplete labels in evaluation and reporting. It also surfaces asymmetric costs: false negatives lose money and trust; false positives block legitimate customers and harm conversion.

Step-by-step, the framed ML task is likely not a single binary decision. It’s a risk score plus policy thresholds where business tolerances are encoded. Ranking metrics matter because review is capacity-bound; calibration matters because thresholds and cost computations assume probabilities are meaningful. Leakage risks are severe: anything created after the transaction—investigation notes, customer complaints, disputes—must be excluded from features at authorization time.

Limitations should be called out up front. Fraud patterns drift, adversaries adapt, and data distribution shifts quickly. Even a well-framed model needs monitoring, but framing decides what “good” means: perhaps keeping fraud losses below a target while maintaining approval rate and holding false declines under a cap. With that frame, stakeholders can accept trade-offs intentionally rather than arguing after deployment.

The framing recap that keeps projects aligned

Problem framing is the discipline of making ML projects specific, testable, and decision-shaped before modeling begins. A strong frame defines the target precisely, chooses the ML output that fits the decision (binary vs probability vs ranking), locks down time boundaries to prevent leakage, and picks success metrics that reflect real constraints and costs. When this is done well, modeling becomes an implementation detail rather than an expensive guessing game.

Now that the foundation is in place, we’ll move into Data Readiness & Risk Checks [20 minutes].

Last modified: Thursday, 19 February 2026, 8:46 AM