Sculpting Simplicity from the Infinite

I am a Ph.D. student exploring Embodied Intelligence. I write about Math, CS, AI and my views on them.
"The road is better than the inn." — Cervantes
We are living through a crisis of explanation in Artificial Intelligence (AI).
For decades, statistical learning theory offered us a comfortable, rigid view of the world. We had the bias-variance tradeoff. We had the VC dimension. We believed, with religious fervor, that model complexity must be strictly controlled. If you have $N$ data points, and you try to fit a model with \(P \gg N\) parameters, you will overfit. You will memorize the noise. Your test error will explode.
And yet, here we are. We routinely train Transformers with hundreds of billions of parameters on datasets that are, by comparison, microscopic. We push training error to absolute zero. We turn off weight decay. We strip away Dropout.
By all classical logic, these models should be catastrophic failures. But they are not. They generalize with an almost unreasonable effectiveness.
When a theory fails to explain the empirical reality, we must not discard the reality; we must discard the theory. The resolution to this paradox lies in the hidden mechanism, a “ghost“ within our optimization algorithms. We have spent so long analyzing the landscape (the loss function) that we forgot to analyze the walker (the optimizer).
In this article, we will explore Implicit Bias of Gradient Descent. We will rigorously derive why the standard gradient descent acts as a maximum-margin machine, perform three experiments to visualize the phenomenon, and discuss “how“ of learning is just as important as the “what“.
I. The Geometry of Infinity
Let us begin with the simplest setting to build our intuition: Linear Classification on Separable Data.
Consider a dataset \(S = \{ (x_i, y_i) \}_{i=1}^n\) where \(x_i \in \mathbb{R}^d\) and \(y_i \in \{ -1, +1\}\). We assume the data is linearly separable. We train a linear model \(f(x) = w^\top x\) using the exponential loss (a smooth upper bound on the 0-1 error, similar to Cross-Entropy):
\[\mathcal{L}(w) = \sum_{i=1}^n \exp(-y_i w^\top x)\]
We minimize this using Gradient Descent (GD) with a small step size \(\eta\):
\[w_{t+1} = w_t - \eta \nabla \mathcal{L}(w_t)\]
The Paradox of Zero Loss
Since the data is separable, there exists a \(w^*\) such that \(y_i {w^*}^\top x_i \gt 0\) for all $i$. To minimize \(\exp(-u)\), $u$ must go to infinity. Therefore, to minimize the loss, the magnitude of the weight vector \(\| w_t \|\) must grow to infinity.
\[\lim_{t \to \infty} \|w_t\| = \infty\]
In classical convex optimization, diverging iterates are a nightmare. But here, they are a feature. In classification, we only care about the sign of the prediction. The magnitude scales the confidence, but the direction determines the decision boundary.
The central question of Implicit Bias is: As \(\|w_t\| \to \infty\), to which direction does \(\frac{w_t}{\|w_t\|}\) converge?
There are infinite directions that separate the data (the “version space“). Does GD sample from them uniformly? No.
The Derivation
Let us look at the gradient.
\[\nabla \mathcal{L}(w) = -\sum_{i=1}^n \exp(-y_i w^\top x_i)y_i x_i\]
Let \(\alpha_i(w) = \exp(-y_i w^\top x_i)\). The scalar represents the “difficulty“ of point $i$. If the point $i$ is well classified (large margin), \(\alpha_i \approx 0\). If the point $i$ is near the boundary, \(\alpha_i\) is large.
The gradient update is a linear combination of data points:
\[w_{t+1} = w_t + \eta \sum_{i=1}^n\alpha_i(w_t)y_i x_i\]
As \(t \to \infty\), the weights \(w_t\) grow large. Let us decompose \(w_t\) into a magnitude \(\rho_t\) and a unit direction \(\hat{w}_t\):
\[w_t = \rho_t \hat{w}_t\]
Since \(\rho_t \to \infty\), the term \(\exp(-\rho_t y_i \hat{w}_t^\top x_i)\) decays exponentially fast. The rate of decay depends on the margin \(y_i \hat{w}_t^\top x_i\).
The datapoints with the smallest margin (the support vectors) will have the slowest decay. They will dominate the sum. For any data point $j$ that is not a support vector (i.e., strictly easier to classify than the hardest point), the ratio of its influence to the support vector’s influence goes to zero:
\[\frac{\alpha_j(w_t)}{\alpha_\text{SV}(w_t)} = \exp(-\rho_t(y_j \hat{w}_t^\top x_j - \gamma)) \to 0\]
where \(\gamma\) is the minimum margin.
This means that asymptotically, the gradient updates lie strictly in the subspace spanned by the support vectors.
\[w_\infty \propto \sum_{i \in \text{SV}}c_i y_i x_i\]
This is precisely the form of solution to the Hard Margin Support Vector Machine (SVM) dual problem.
Even without an \(L_2\) regularization term (Weight Decay) in the loss, the dynamics of Gradient Descent on exponential tails minimize the \(L_2\) norm relative to the margin. It solves:
\[\arg \max_{\hat{w}, \|\hat{w}\|=1} \min_i y_i \hat{w}^\top x_i\]
This is the “Geometric Ghost“. The optimizer seeks the most robust solution.
Experiment 1: The Convergence of Trajectories
We can prove this theorem, but seeing is believing. We will generate a linearly separable 2D dataset. We will initialize the weights in two completely different basins of attraction. If GD were merely finding a solution, the final directions would be different (dependent on initialization). If GD has an implicit bias, they should magnetically snap to the same angle, which is the max-margin angle.
The Setup
Data: Two Gaussian clouds, linearly separable.
Model: Linear layer (2 parameters), no bias term for simpler visualization.
Optimization: Vanilla GD, constant learning rate, no weight decay.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# 1. Data Generation
np.random.seed(42)
N = 10
# Create two separable clusters
X_pos = np.random.randn(N, 2) * 0.5 + [1.5, 1.5]
X_neg = np.random.randn(N, 2) * 0.5 + [-1.5, -1.5]
X = np.vstack([X_pos, X_neg])
y = np.hstack([np.ones(N), -np.ones(N)])
# Max Margin Calculation (Using sklearn for Ground Truth comparison)
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1000)
clf.fit(X, y)
svm_w = clf.coef_[0]
svm_dir = svm_w / np.linalg.norm(svm_w)
# 2. The Gradient Descent Engine
def run_gd_dynamics(init_w, iterations=50000, lr=0.01):
w = np.array(init_w, dtype=float)
path = []
for t in range(iterations):
# Record normalized direction
w_norm = np.linalg.norm(w)
if w_norm > 1e-9:
path.append(w / w_norm)
else:
path.append(w)
# Exponential Loss Gradient: - sum( y * x * exp(-y * w^T x) )
scores = X @ w
margins = y * scores
# We use a stable exp calculation
coeffs = np.exp(-margins)
grad = - (coeffs[:, None] * y[:, None] * X).sum(axis=0)
w = w - lr * grad
return np.array(path)
# 3. Execution from Diverse Initializations
inits = [[-2.0, 4.0], [4.0, -1.0], [-3.0, -3.0], [1.0, 5.0]]
paths = [run_gd_dynamics(i) for i in inits]
# 4. Visualization
plt.figure(figsize=(10, 10))
ax = plt.gca()
# Plot Unit Circle
theta = np.linspace(0, 2*np.pi, 200)
plt.plot(np.cos(theta), np.sin(theta), color='gray', linestyle='--', alpha=0.3)
# Plot Paths
colors = cm.viridis(np.linspace(0, 1, len(inits)))
for i, path in enumerate(paths):
plt.plot(path[:, 0], path[:, 1], color=colors[i], linewidth=2, label=f'Init {inits[i]}')
# Mark start and end
plt.scatter(path[0, 0], path[0, 1], color=colors[i], marker='o')
plt.arrow(path[-2, 0], path[-2, 1], path[-1, 0]-path[-2, 0], path[-1, 1]-path[-2, 1],
head_width=0.05, color=colors[i])
# Plot Ground Truth SVM Direction
plt.arrow(0, 0, svm_dir[0], svm_dir[1], color='red', width=0.01, head_width=0.1, label='SVM (Max Margin)')
plt.title("Implicit Bias: Convergence to Max-Margin Direction", fontsize=15)
plt.xlabel("Normalized w_1")
plt.ylabel("Normalized w_2")
plt.legend()
plt.axis('equal')
plt.grid(True, alpha=0.2)
plt.show()
Observation

The plot reveals a striking behavior. The trajectories begin chaotically, dictated by the local gradients of their starting positions. However, as the norm grows and the loss drops, they all converge to a single point on the unit circle. The red arrow (the SVM solution calculated via quadratic programming) aligns perfectly with the asymptote of the Gradient Descent paths. The optimizer has "chosen" the simplest, most robust solution without being told to.
Experiment 2: The Low Rank Compulsion
Linear models are one thing, but Deep Learning is built on matrix multiplications. What happens when we have overparameterized matrices?
Consider Matrix Completion. We have a ground truth matrix \(M^*\), but we only observe a subset of its entries. We want to recover the full matrix.
We parameterize our candidate matrix as a product of two matrices: \(M = UV^\top\).
If $U$ and $V$ are \(N \times N\), the resulting matrix $M$ can be full rank (Rank $N$). We have enough parameters to fit the observed entries in infinite ways, many of which would result in full-rank, noisy matrix.
However, a famous result by Gunasekar et al. suggests that Gradient Descent on the factorization \(UV^\top\) minimizes the Nuclear Norm (the sum of singular values). It implicitly seeks Low Rank solutions.
The Setup
Task: Recover a Rank-1 matrix from 60% of its entries.
Model: \(M = UV^\top\) where inner dimension is equal to $N$ (allowing for full rank).
Initialization: Very small random numbers (crucial for the bias to manifest).
import numpy as np
import matplotlib.pyplot as plt
N = 20
Rank_True = 1
# 1. Create Ground Truth Low-Rank Matrix
U_true = np.random.randn(N, Rank_True)
V_true = np.random.randn(N, Rank_True)
M_true = U_true @ V_true.T
# 2. Create Observation Mask (60% visible)
mask = np.random.rand(N, N) < 0.6
# 3. Initialize Full-Rank Capacity Model
# Inner dimension = N, so it CAN represent any rank up to N
U = np.random.randn(N, N) * 1e-3 # Small initialization is key
V = np.random.randn(N, N) * 1e-3
# 4. Optimization Loop
lr = 0.01
steps = 20000
singular_val_evolution = []
for t in range(steps):
# Forward
M_pred = U @ V.T
diff = (M_pred - M_true) * mask # Loss only on observed entries
loss = 0.5 * np.sum(diff**2)
# Gradients for U and V (Chain rule)
grad_U = diff @ V
grad_V = diff.T @ U
U -= lr * grad_U
V -= lr * grad_V
if t % 100 == 0:
# Record singular values to track rank
_, s, _ = np.linalg.svd(M_pred)
singular_val_evolution.append(s)
# 5. Analysis
final_s = singular_val_evolution[-1]
final_s = final_s / final_s.max() # Normalize
plt.figure(figsize=(10, 5))
# Plot Evolution of Singular Values
s_hist = np.array(singular_val_evolution)
for i in range(5): # Plot top 5 singular values
plt.plot(s_hist[:, i], label=f'Singular Value {i+1}')
plt.title("Implicit Regularization toward Low Rank Solutions")
plt.ylabel("Magnitude")
plt.xlabel("Step / 100")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("Final Normalized Singular Values (First 5):", final_s[:5])
Observation

Even though the model had the capacity to be Rank 20 (and overfit the observed entries with a high-rank mess), the singular values show a "winner-take-all" dynamic. The first singular value grows rapidly to capture the signal. The subsequent singular values remain suppressed near zero. The optimizer spontaneously performed model compression. It "discovered" that the data was Rank-1.
Experiment 3: The Geometry of Algorithms (SGD vs Adam)
If the bias comes from the dynamics, then changing the update rule should change the solution. SGD updates based on the Euclidean gradients (\(\ell_2\) geometry). Adam normalizes the variance, effectively taking steps that look more like the corners of a hypercube (\(\ell_\infty\) geometry).
For a linear model, SGD converges to the max \(\ell_2\) margin. Adam, interestingly, tends to converge closer to max \(\ell_\infty\) margin solution (or feature selection solution), favoring sparse weights where only the most informative features are used.
Let us test these competition of geometries.
import torch
import torch.nn as nn
import torch.optim as optim
# 1. Data Setup (Same as Exp 1 but strictly separable tensors)
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)
def trace_optimizer(opt_class, name, color, start_w):
w = torch.tensor(start_w, dtype=torch.float32, requires_grad=True)
if name == 'Adam':
opt = opt_class([w], lr=0.1)
else:
opt = opt_class([w], lr=0.1)
path = []
for _ in range(2000):
# Normalize for plotting
with torch.no_grad():
w_norm = w / torch.norm(w)
path.append(w_norm.numpy().copy())
# Exponential Loss
loss = torch.sum(torch.exp(-y_t * (X_t @ w)))
opt.zero_grad()
loss.backward()
opt.step()
return np.array(path), name, color
# 2. Run Comparison
start_pos = [-2.0, -0.5]
path_sgd, _, _ = trace_optimizer(optim.SGD, "SGD", "blue", start_pos)
path_adam, _, _ = trace_optimizer(optim.Adam, "Adam", "orange", start_pos)
# 3. Visualization
plt.figure(figsize=(8, 8))
theta = np.linspace(0, 2*np.pi, 200)
plt.plot(np.cos(theta), np.sin(theta), 'k--', alpha=0.2)
plt.plot(path_sgd[:, 0], path_sgd[:, 1], label='SGD', color='blue', linewidth=2)
plt.plot(path_adam[:, 0], path_adam[:, 1], label='Adam', color='orange', linewidth=2)
# Start point
plt.scatter(path_sgd[0,0], path_sgd[0,1], c='black', label='Start')
plt.title("Algorithmic Bias: SGD vs Adam Trajectories")
plt.axis('equal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Observation

As you can see, the trajectories diverge. SGD takes a smoother more direct curve toward the max-margin solution. Adam, due to its component-wise scaling, takes a more rigid, sometimes “Manhattan-like“ path. They both achieve zero training error, but they land on different spots on the sphere. This explains why, in ImageNet training, SGD outperforms Adam in generalization, the \(\ell_2\) geometry of SGD aligns better with the natural symmetries of image data than the geometry of Adam.
Philosophical Implications
We often perceive optimization as a means to end. We define a loss function \(L(\theta)\) as the “substance“ of our problem, and the optimizer is just the tool to slide down the hill.
These experiments force us to adopt a Process Philosophy. The solution is not defined solely by the landscape \(L(\theta)\). It is defined by the trajectory \(\theta(t)\).
In the overparameterized regime (where \(L(\theta) = 0\) forms a vast manifold), the “what“ (the solution) is determined by the “how“ (the algorithm).
Initialization matters: It determines the basin of attraction.
Step size matters: Large step size prevent sharp minima selection.
Curvation matters: The geometry of the update rule shapes the inductive bias.
This is reminiscent of the concept of Autopoiesis, a system capable of maintaining and creating itself. The neural network, during training, is not just learning data; it is self-organizing into a low-complexity state driven by the thermodynamic-like laws of the gradient flow.
Real-World Impact
Why does this matter outside of a blog post?
Adversarial Robustness: The “max margin“ bias of SGD is a primary reason deep networks are somewhat robust to noise. If we used an optimizer with a different bias, we might fit the data perfectly but the decision boundary graze the data points, making the model incredibly brittle.
Grokking: Recently, researchers observed “grokking“—where validation accuracy stays flat for thousands of epochs and then suddenly jumps. This is the implicit bias at work. The model “memorizes“ first (fast optimization), but the slow, steady pull of the implicit regularization eventually reorganizes the weights into a generalizing solution.
Pruning and Compression: The low-rank bias (Exp-2) explains why we can prune 90% of a trained network without losing accuracy. The training process naturally pushed the “information“ into a low-dimensional subspace, leaving rest of the parameters as redundant scaffolding.
Conclusion
The optimizer is not a natural explorer. It is a biased architect.
As we build larger models, we must stop thinking of "Generalization" and "Optimization" as separate fields. They are two sides of the same coin. The geometry of the optimization path is the generalization guarantee. We are not just searching for a needle in a haystack; the magnet we use to search constructs the needle.
If you enjoyed this deep dive, try running the provided code with different initializations or random seeds. The consistency of the convergence is truly one of the most beautiful phenomena in high-dimensional mathematics.



