Optimization: Schedules, Reg, Init
When “it trains” but won’t converge again tomorrow
You’re fitting a regularized logistic regression churn model on a rolling, time-stamped event table. On Monday’s training window, it reaches a solid validation AUC in minutes. On Tuesday, with the same code and almost the same data volume, it oscillates, converges slowly, or lands in a different solution with noticeably worse calibration. Nothing “mystical” happened—your optimizer simply walked a different path through the same loss landscape.
In classical ML, most production pain around optimization comes from three levers that quietly interact: learning-rate schedules, regularization, and initialization. These aren’t afterthoughts; they determine whether the algorithm finds a stable, generalizing solution—or a brittle one that depends on accidents of scaling, data ordering, and numeric conditioning.
This lesson makes those levers explicit. You’ll see how schedules control how you search, how regularization changes what you prefer, and how initialization determines where you start—all with concrete guidance for convex (and nearly convex) classical ML settings and for the practical realities of iterative solvers.
The optimization triad: schedules, regularization, initialization
When we say “optimization” here, we mean iteratively minimizing an objective like:
[ \min_w\;\; \underbrace{\frac{1}{n}\sum_{i=1}^n \ell(y_i, f_w(x_i))}{\text{data fit}} \;+\; \underbrace{\lambda \,\Omega(w)}{\text{regularization}} ]
Three terms in that story deserve crisp definitions.
Learning-rate schedule: the rule for choosing the step size (\eta_t) over iterations (t) (and sometimes per-parameter). In gradient descent / SGD, it decides how aggressively you move and how quickly you “cool” from exploration into convergence. In coordinate descent and quasi-Newton methods, “schedule” often shows up as line search or damping, but the same principle holds: you’re controlling how far each update is allowed to go.
Regularization: any mechanism that biases solutions toward desirable properties—small norms (L2), sparsity (L1), smoothness, or stability—to improve generalization. This is an explicit form of inductive bias: you restrict the effective hypothesis class, not by architecture but by objective geometry.
Initialization: the starting point (w_0). In convex problems, initialization doesn’t change the global optimum, but it can change time-to-solution a lot. In non-convex or poorly conditioned settings (even “classical” ones, like matrix factorization or some boosting setups), initialization can change which local region you settle into.
A useful unifying analogy is: regularization is the map, schedule is the driving style, and initialization is where you enter the map. If the map is narrow (strong reg) you can drive faster without flying off; if the road is icy (ill-conditioned features) you need smaller steps or better tires (preconditioning / scaling). These interact directly with the inductive bias idea you’ve already seen: constraints and preferences don’t just “generalize better”—they often optimize easier as well.
How schedules, reg, and init actually change the search
Learning-rate schedules are a stability tool, not a decoration
A step size is a promise: “my local gradient is informative enough that moving (\eta_t) units won’t overshoot.” When that promise fails—because features are poorly scaled, the curvature varies wildly, or gradients are noisy—you see oscillation, divergence, or painfully slow progress. The schedule is how you keep that promise true over time.
In classical ML, you commonly face two regimes. With full-batch gradient descent on a smooth convex loss (e.g., ridge regression, logistic regression with L2), a constant step under the Lipschitz condition can converge predictably, but the safe constant may be too small if the problem is ill-conditioned. With SGD / mini-batch (common in large-scale linear models), gradients are noisy; a constant LR can hover around the optimum, while a decaying LR can tighten the distribution and actually converge.
A practical comparison is the “explore then settle” pattern:
-
Early iterations benefit from larger steps to move quickly from a bad start.
-
Later iterations benefit from smaller steps to avoid bouncing in narrow valleys and to reduce sensitivity to gradient noise.
Common schedules in classical ML include:
-
Step decay (drop by a factor at fixed milestones): easy to reason about and often sufficient.
-
Exponential decay: smoother version of step decay; can decay too fast if not tuned.
-
Inverse-time decay ((\eta_t = \eta_0/(1 + kt))): a classic SGD choice when you want eventual convergence.
Best practices:
-
Scale features (standardization) before blaming the schedule. Many “LR problems” are really conditioning problems from heterogeneous feature magnitudes.
-
Track loss decrease per iteration and update norms; if updates explode or alternate in sign, you’re too aggressive.
-
If you use SGD, treat LR and batch size as coupled: larger batches reduce noise and often tolerate larger LRs, but can reduce the beneficial stochasticity that helps escape flat regions.
Typical misconceptions:
-
“Higher LR always trains faster.” It can reduce iterations while dramatically increasing wall-clock due to divergence retries or tiny line-search steps in second-order methods.
-
“Decay is only for deep nets.” In large-scale convex optimization, decay is often the difference between a stable convergent process and a perpetual random walk near the optimum.
Common pitfalls:
-
Decaying too early: you freeze into a mediocre region before reaching the basin of good solutions.
-
Never decaying in noisy SGD: you get a model whose parameters keep jittering, harming calibration and reproducibility.
-
Ignoring curvature: one global LR can be too big for sharp directions and too small for flat ones—this is why feature scaling and (when available) preconditioning matter.
[[flowchart-placeholder]]
Regularization shapes both generalization and the geometry you optimize
Regularization is usually described as “prevent overfitting,” but under the hood it also modifies the curvature and identifiability of the objective. In classical ML, that geometric effect is often just as important operationally as the statistical one.
With L2 (ridge) regularization, you add (\lambda |w|_2^2). Statistically, this shrinks coefficients, reduces variance, and stabilizes estimates under multicollinearity. Geometrically, it makes the problem more strongly convex (for many losses), improving conditioning—meaning gradient-based methods can take more reliable steps, and solutions become less sensitive to small data shifts. This is why ridge is often the “boring but safe” choice in high-dimensional tabular settings: it’s not only a bias toward small weights; it’s also a bias toward stable optimization.
With L1 (lasso) regularization, (\lambda |w|_1) induces sparsity and can perform feature selection. The optimization geometry is different: the objective becomes non-smooth, and the optimum often lies on corners where many coefficients are exactly zero. Methods like coordinate descent and proximal gradient thrive here, but naive gradient descent struggles. Practically, L1 can improve interpretability and reduce deployment cost, but it can also be unstable when features are highly correlated—small data changes can swap which correlated feature gets the nonzero weight, giving you a model that looks “sparse” yet is not robust.
Regularization choices also interact with inductive bias in a very direct way: you’re encoding what you think “simple” means. For churn, “simple” might mean small overall influence (L2), or few active signals (L1), or a compromise (elastic net). Getting that bias wrong can be worse than underfitting: you can optimize perfectly toward the wrong notion of simplicity and then wonder why production shifts break you.
Best practices:
-
Use L2 when you care about stability, correlated features are common, and you want predictable shrinkage.
-
Use L1 when sparsity is genuinely valuable (latency, interpretability), but validate robustness under resampling or time splits.
-
Consider elastic net to reduce L1’s instability with correlated groups while keeping sparsity pressure.
Typical misconceptions:
-
“More regularization always means worse training fit, better test.” True in spirit, but in practice mild regularization can improve optimization enough that you get better training fit at fixed compute because you can use larger steps and avoid numerical pathologies.
-
“L1 selects the ‘true’ features.” It selects a sparse solution under the chosen penalty; in correlated feature sets, “which one” is often arbitrary.
Common pitfalls:
-
Regularizing without standardizing features: penalties become scale-dependent, effectively biasing the model toward using large-scale features.
-
Tuning (\lambda) on random splits for time-dependent data: you overestimate generalization because leakage and regime mixing hide instability.
Initialization affects reproducibility, speed, and sometimes the solution
In convex classical ML, you might think initialization is irrelevant: the global optimum is unique (or at least the set of optima is well behaved). That’s partly true—but “doesn’t change the optimum” is not the same as “doesn’t matter.” Initialization can dominate how quickly you reach the optimum, whether you hit a stopping criterion prematurely, and how stable your solution looks under finite precision and early stopping.
For smooth convex problems solved with iterative methods, a good initialization can cut iterations dramatically. Examples include starting logistic regression at zeros vs. starting from the previous day’s solution (a warm start) in a rolling training setup. Warm starts are extremely effective when yesterday’s optimum is close to today’s: you spend compute refining rather than rediscovering. For coordinate descent in L1 problems, warm starts across a (\lambda)-path are a standard trick: solve for a large (\lambda) (very sparse), then reuse the solution as initialization for a slightly smaller (\lambda).
Initialization also interacts with regularization and schedules through implicit bias and stopping. If you stop early (common in large-scale settings), then initialization meaningfully shapes what function you end up with—because you are not at the true optimum. Even in convex landscapes, early stopping acts like a form of regularization, and different starts can yield different effective solutions at the same compute budget.
Best practices:
-
Prefer warm starts in rolling windows, hyperparameter sweeps, or (\lambda)-paths.
-
Use zero initialization as a baseline for reproducibility, but don’t confuse “reproducible start” with “fast start.”
-
Make stopping criteria explicit: gradient norm, objective tolerance, and max iterations determine how much initialization influences your final model.
Typical misconceptions:
-
“Initialization only matters in deep learning.” In industry-scale linear models trained with SGD and early stopping, it can change outcomes significantly.
-
“Warm starts are always safe.” They can drag stale bias forward if the data distribution shifts; you may converge quickly to yesterday’s worldview.
Common pitfalls:
-
Silent early stopping: you think you “trained to convergence,” but you actually stopped at a compute cap; the initialization then becomes part of the model design.
-
Changing data order with SGD without controlling seeds: two “same” runs are now two different optimization processes.
Choosing the right lever: what to change first
When training is unstable or performance is inconsistent, it’s tempting to tune everything at once. A tighter approach is to ask: is the problem (a) too noisy, (b) poorly conditioned, or (c) overly flexible relative to data?
Here’s a practical comparison of the three levers and what they’re best at addressing.
| Dimension | Schedules (LR / damping / line search) | Regularization (L1/L2/EN, early stopping) | Initialization (zero, warm start, heuristic) |
|---|---|---|---|
| Primary job | Control stability and convergence speed of iterative updates. | Encode preference for simpler solutions and reduce variance; often improves conditioning. | Set the starting point; affects time-to-solution and early-stopped outcomes. |
| Most helpful when | You see oscillation/divergence, noisy SGD, or slow convergence late in training. | You see overfitting, coefficient explosions, multicollinearity, or unstable solutions across splits. | You retrain frequently, run (\lambda)-paths, or see large iteration counts from a cold start. |
| Typical failure mode | Decay too fast (premature freezing) or too slow (never settles); one LR for all directions fails under bad scaling. | L1 instability under correlated features; penalties misbehave without feature scaling; wrong (\lambda) chosen under leakage. | Warm start locks in yesterday’s bias under shift; early stopping makes start “part of the model.” |
| What to check first | Feature scaling; gradient/update norms; batch size vs LR; whether you’re effectively at a noise floor. | Standardization; time-aware validation; coefficient path as (\lambda) changes; sensitivity to resampling. | Whether you’re compute-limited; whether stopping criteria are binding; distribution shift across windows. |
A useful rule of thumb:
-
Fix feature scaling and validation protocol first (they underlie all three).
-
If you’re unstable, adjust schedule next.
-
If you’re stable but not generalizing, adjust regularization.
-
If you’re wasting compute, adjust initialization / warm starts.
Applied example 1: Rolling churn model with SGD that “forgets how to train”
You maintain a churn classifier trained nightly on the last 90 days of clickstream aggregates. The dataset is large enough that you use SGD or a mini-batch method for logistic regression with L2. One week, training starts taking longer and produces a slightly different calibration curve each day, even when the validation AUC is similar. Support tickets start appearing because the risk scores drift.
Step-by-step, the optimization triad explains what’s happening:
- Initialization: if you cold-start at (w_0 = 0) daily, you spend early iterations relearning yesterday’s coefficients. Warm-starting from yesterday often cuts iterations and reduces variance in the final parameters at fixed compute.
- Schedule: a constant LR with mini-batches can keep parameters jittering near the optimum. That jitter may not move AUC much but can meaningfully move score calibration, especially at decision thresholds. Introducing a gentle decay or switching to a smaller LR after a few epochs often stabilizes the end state.
- Regularization: if feature correlations grow over time (new product instrumentation often causes this), weak L2 may allow large cancelling weights that fit training but are sensitive to tiny distribution changes. Increasing L2 can improve both generalization and the numeric conditioning that the optimizer sees, making each step more reliable.
Impact, benefits, limitations:
-
Warm starts and modest LR decay usually improve retraining reliability and reduce compute, but they can also hide distribution shift by converging quickly to a familiar solution. You still need time-aware evaluation to catch when “fast convergence” is actually “fast overconfidence.”
-
Stronger L2 tends to improve stability, but if it’s too strong you can wash out rare-but-real signals (e.g., infrequent support events that legitimately predict churn). The trade-off is not abstract—it shows up as fewer false alarms versus missed churners depending on your operating threshold.
Applied example 2: Sparse model for policy decisions that breaks under correlated features
A risk team wants an interpretable, sparse model: “no more than 30 nonzero coefficients.” You try L1-regularized logistic regression on tabular features including multiple variants of the same behavior (counts, rates, rolling windows) which are highly correlated. The model hits the sparsity target but flips which features are selected whenever you retrain, and stakeholders lose trust.
Step-by-step, you can diagnose and fix this with the same three levers:
- Regularization choice: pure L1 encourages sparsity but is unstable with correlated groups—any one of several near-duplicates can explain the same signal, so the “winner” changes with small data shifts. Elastic net (L1 + L2) often keeps sparsity while reducing this feature-swapping behavior by adding group-stabilizing shrinkage.
- Feature scaling: if you forgot to standardize, L1 penalizes coefficients in a scale-dependent way, effectively selecting based on units rather than usefulness. Standardization doesn’t guarantee stability, but without it your penalty is miscalibrated from the start.
- Initialization and paths: rather than jumping to a single (\lambda), compute a (\lambda)-path with warm starts. Watching when coefficients enter/leave as (\lambda) decreases gives you a stability picture: if the top signals are unstable near your chosen sparsity level, that’s a warning that the “30 features” constraint is fighting the data geometry.
Impact, benefits, limitations:
-
Elastic net and path-based selection typically improve interpretability durability (the story stays the same across retrains), but you may end up with slightly less sparsity than a pure L1 solution. In governance-heavy environments, “stable and explainable” often beats “maximally sparse.”
-
If the business insists on a hard cap on features, you may need to accept that the optimization objective alone can’t guarantee semantic stability; stability becomes part of the model requirement, not a happy accident.
The point: make optimization a design choice
Optimization is not just “how you minimize”—it’s part of what model you end up deploying. Schedules determine whether the search is stable and whether you truly converge or just hover. Regularization expresses your inductive bias and reshapes the optimization landscape, often improving conditioning and robustness. Initialization controls compute efficiency and, under early stopping or drift, can influence the effective solution.
If you treat these as first-class levers—rather than last-minute tuning—you get models that retrain predictably, behave consistently under small shifts, and generalize for the reasons you can explain.
Next, we’ll build on this by exploring Generalization & Scaling Basics [20 minutes].