From Features to Representations
Why “good features” suddenly stop being enough
A credit-risk team ships a solid logistic regression model and it performs well—until a new product launch changes customer behavior. The raw inputs are still there (income, utilization, payment history), but the model starts making confident mistakes in pockets of the population it used to handle. The uncomfortable truth is that the predictive information didn’t disappear; the way it’s expressed in the data shifted.
In classical ML, this is where feature engineering often turns into an endless cycle: add ratios, bin variables, log-transform, re-encode categories, and hope the model “sees” what you see. That can work, but it scales poorly and can be brittle under distribution changes. A more durable approach is to think in terms of representations: not just which columns you feed a model, but what space the data lives in after you transform it.
This lesson reframes the workflow from “collect features” to “learn or construct representations”—and shows why that shift matters even when you still plan to use classical models.
Features, representations, and the real job of a model
A feature is an input variable you provide to a model—often a column in a design matrix (X). Features can be raw (age) or engineered (log(age), debt-to-income, interaction terms). In practice, “feature engineering” means deciding which measurements to include and how to encode them.
A representation is the transformed description of an example that the model actually uses to separate, rank, or predict outcomes. It might still be a vector of numbers, but it’s better to think of it as a geometry: distances, angles, neighborhoods, and directions that define what “similar” means. The same raw features can yield very different representations depending on scaling, encoding, basis expansion, or learned embeddings.
A useful analogy: features are ingredients; representations are the cooked dish. Two chefs can start with the same ingredients and produce meals with different texture and flavor. In ML terms, a model’s success often depends less on the algorithm and more on whether the representation makes the target function simple (often close to linear or smoothly varying). Classical ML lives and dies by this: the “right” representation can make a linear model competitive, while a poor one makes even complex models struggle.
How representations reshape the problem (and why linear models care)
The core idea is simple: learning is easier when the relationship between inputs and target is simple in the chosen space. Many classical models—linear regression, logistic regression, linear SVM—assume that in the representation space, the decision boundary is a hyperplane or close to one. If the true boundary is curved or depends on interactions, raw features may force the model to approximate it poorly.
Representation building changes the effective hypothesis class without changing the downstream model. A polynomial feature map, spline basis, or interaction expansion can turn a nonlinear relationship into something a linear model can fit. Similarly, kernels implicitly map data into high-dimensional spaces where linear separators correspond to nonlinear decision boundaries in the original space. Even “mundane” steps like standardization change optimization geometry and the meaning of regularization, which directly affects what the model can express and prioritize.
One way to keep this grounded is to explicitly separate:
-
Information: what predictive signal exists in the raw measurements.
-
Accessibility: whether the model class can access that signal given the representation.
-
Stability: whether the representation keeps meaning under drift (e.g., seasonality, product changes, measurement noise).
A strong representation increases accessibility and stability. A weak one forces you to compensate with complexity, data volume, or fragile heuristics.
Three classical routes from features to representations
Classical ML typically gets representations in three broad ways: manual transformations, learned linear transforms, and implicit high-dimensional mappings. They differ in what they assume, how they fail, and how interpretable they remain.
| Dimension | Manual feature maps (scaling, logs, interactions, bins) | Learned linear representations (PCA, LDA, ICA) | Kernel/implicit mappings (RBF SVM, kernel ridge) |
|---|---|---|---|
| What it does | Applies chosen transforms to make relationships more linear, comparable, or monotonic. You explicitly define the basis the model sees. | Learns a new coordinate system from data: directions of variance (PCA) or class separability (LDA). The model then works in that reduced/rotated space. | Defines similarity through a kernel; the model acts as if data lives in a richer feature space without constructing it explicitly. |
| When it shines | When domain structure is clear (ratios, rates, saturation effects) and you need interpretability and control. Works well with limited data and strict governance. | When features are collinear, noisy, or high-dimensional (e.g., correlated sensors, text TF-IDF). Can reduce noise and improve conditioning. | When decision boundaries are highly nonlinear and you have enough data to define local neighborhoods. Useful when manual feature design is hard. |
| Best practices | Use consistent scaling for regularized models; encode categories carefully; add interactions only when justified; monitor leakage. | Standardize before PCA; pick components via validation, not variance alone; check stability of components across time slices. | Tune kernel width and regularization jointly; use proper cross-validation; consider approximate kernels for scalability. |
| Common pitfalls | Brittleness under drift; exploding dimensionality with interactions; hidden leakage in target-derived aggregates. | Optimizing the wrong objective (variance ≠ predictiveness); losing interpretability; components shifting with data drift. | Overfitting via overly flexible kernels; compute/memory blowups; misleading confidence far from training support. |
| Typical misconception | “More engineered features always help.” More features can dilute signal and worsen generalization under regularization. | “PCA always improves models.” It can remove predictive low-variance directions or destroy sparse semantics. | “Kernels automatically solve feature engineering.” They still require careful scaling, tuning, and validation to avoid poor inductive bias. |
Representation fundamentals that advanced practitioners rely on
Inductive bias lives in the representation as much as in the model
Inductive bias is the set of assumptions that lets a model generalize beyond training data. In classical ML, it’s tempting to attribute inductive bias to the algorithm (“linear models assume linearity,” “trees assume axis-aligned splits”). But the representation often encodes stronger assumptions than the model itself.
For instance, turning a continuous variable into bins assumes piecewise-constant effects and throws away within-bin ordering. Adding interaction terms assumes specific multiplicative relationships between variables, but also asserts that those interactions matter globally, not just in certain regimes. Standardizing features doesn’t change expressiveness, but it changes the effective strength of L1/L2 penalties across variables—so it changes which patterns the model prefers.
A powerful mental model is: your pipeline is the hypothesis class. The combination of preprocessing + representation + model + regularization defines what functions you can fit and which ones you prefer. Two teams can both say “we used logistic regression,” but if one uses target encoding with leakage-safe folds, splines for continuous variables, and calibrated probabilities, they are effectively using a different learning system than another team feeding raw columns.
The practical implication is governance-level: when you think “model choice,” include representation decisions in the same review. Many production failures come from silent representation changes (a new category appears, a scaler is refit on different data, a PCA component order flips) rather than from the classifier itself.
Capacity, dimensionality, and the hidden cost of “making it linear”
Representation improvements often look like “expand the feature space.” Polynomial expansions, one-hot encodings with many categories, and interaction terms can balloon dimensionality. That increases capacity, which can help fit complex structure—but it also increases variance and can worsen generalization if not matched with regularization and data.
Regularization becomes the control knob that decides which directions in representation space are allowed to matter. In L2-regularized logistic regression, weights are penalized by magnitude, so feature scaling directly affects which variables are expensive. In L1 regularization, sparsity depends on how features are scaled and correlated; correlated groups can lead to unstable selection, where small data shifts swap which feature “wins.”
There’s also a computational angle: as dimension grows, optimization can become ill-conditioned, and cross-validation becomes more expensive. High-dimensional sparse representations (like one-hot + interactions) can still be efficient, but only if your tooling and data structures handle sparsity correctly. Otherwise, you pay with memory and training time, which pressures teams into skipping validation steps—creating fragile models with overconfident metrics.
A useful discipline is to treat representation complexity like any other system complexity: you justify it with measurable gains, you constrain it with regularization and validation, and you track it over time to detect drift-induced changes in what the model is actually using.
Similarity is a design choice: representations define neighborhoods
Many classical ML methods implicitly depend on neighborhoods: k-NN explicitly; RBF kernels implicitly; even linear models in a standardized space rely on distances in parameter updates and penalties. The representation defines what “close” means, and that can encode or erase meaningful structure.
Consider mixing numeric and categorical features. If you one-hot encode categories and standardize numerics, the effective distance between two points might be dominated by a single categorical mismatch, even if the numeric profiles are nearly identical. Alternatively, if you scale numerics too aggressively, categories become irrelevant. Either way, your representation is dictating which variations the model treats as “small perturbations” versus “different worlds.”
This matters under drift. If a new category appears, one-hot encoding creates an all-zero vector for that feature block unless you handle unknowns explicitly, which can collapse multiple new categories into the same representation. If price distributions shift, log transforms can stabilize scale but also compress meaningful extremes. If seasonality changes, time encodings can turn a smooth seasonal pattern into an aliasing artifact.
A best practice at this level is to make similarity assumptions explicit: write down what you want to be invariant to (units, monotonic transforms, rare categories) and what you want to remain sensitive to. Then choose representation steps that enforce those invariances deliberately, rather than discovering them accidentally through model surprises.
[[flowchart-placeholder]]
Two applied examples that show the representation shift in action
Example 1: Credit default prediction with monotonic effects and interactions
A lender predicts probability of default using classical tabular data. The raw features include annual income, credit utilization, number of late payments, and recent inquiries. A baseline logistic regression on raw values underperforms and produces unintuitive coefficients—income appears weak, utilization swings wildly, and the model saturates on extreme values.
A representation-focused fix starts by asking what functional forms are plausible. Income and balance-related quantities often have diminishing returns and heavy tails, so log transforms or capped winsorization make the relationship smoother. Utilization behaves nonlinearly: risk increases sharply past certain thresholds, so piecewise linear splines or carefully chosen bins can create a representation where a linear model approximates the risk curve. Late payments and inquiries are counts with sparse high values; using (\log(1 + x)) or capping can prevent rare extremes from dominating.
Next comes interactions that match domain logic. “High utilization” is much more concerning when “income is low,” so an interaction term like utilization × log(income) (or a spline-by-income segment) can expose a boundary that is not linearly separable in raw space. The important part is restraint: you add interactions tied to causal or operational hypotheses, then regularize and validate with time-split evaluation to avoid learning spurious correlations from a single period.
Impact-wise, the model often becomes both more stable and more interpretable: coefficients correspond to slopes in meaningful regions, and calibration improves because the representation aligns with how risk actually changes. The limitation is that this depends on correct assumptions—if product changes alter how utilization reflects risk, the representation can become stale. Monitoring must include feature distributions and partial dependence stability, not just AUC.
Example 2: Text classification where “features” are too literal to generalize
A support organization wants to route tickets into categories (billing, outage, account access). A bag-of-words logistic regression using raw term counts works on frequent phrases but fails on paraphrases and synonyms. Tickets that say “can’t sign in” and “login failure” look different in raw count space; the representation makes them distant even though they mean the same thing.
The representation shift begins with weighting and normalization: TF-IDF often improves the geometry by downweighting ubiquitous terms (“the,” “please,” “help”) and emphasizing discriminative ones. But even TF-IDF stays literal—synonyms still don’t align. Classical ML can still move toward representations by using linear latent spaces such as truncated SVD (LSA) on the TF-IDF matrix, producing dense components that capture co-occurrence structure. In that space, “sign in,” “login,” and “authentication” may project similarly because they appear in related contexts.
Step-by-step, you typically see: build a vocabulary, compute TF-IDF, apply SVD to obtain (k) components, and train a linear classifier on the component scores. The gain is often better generalization to wording variation and reduced dimensionality, which improves conditioning and training speed. It also helps with cold-start terms: rare words can still influence the representation through their co-occurrence patterns rather than needing strong direct weights.
The limitation is interpretability and drift sensitivity. SVD components are mixtures of terms, and their meaning can shift as new ticket types appear or language changes. Governance needs to treat the learned representation as a first-class artifact: version it, fit it on a defined corpus window, and revalidate when the underlying text distribution changes.
Pulling it together: what to watch, what to trust
The practical takeaway is that “features” are not the end of the story; the representation is where learning becomes easy or hard. When models fail unexpectedly, it’s often because the representation makes critical distinctions invisible (no interaction terms, wrong scaling) or makes incidental distinctions too visible (leakage, brittle encodings, overexpanded feature spaces).
Keep a tight checklist in mind when you design or audit a classical ML pipeline:
-
Does the representation match plausible relationships? (monotonicity, saturation, thresholds, interactions)
-
Does preprocessing change the meaning of regularization? (scaling effects, collinearity, sparsity stability)
-
Does the representation define similarity in the way you intend? (categories vs numerics, drift behavior, unknown handling)
-
Is the representation an artifact you can version and monitor? (encoders, scalers, PCA/SVD bases, schema evolution)
This sets you up perfectly for When Classical Models Still Win [20 minutes].