This post started with a supervision question. (Cambridge Math Part II Math for ML Examples Sheet 1).
Working through the variance of Ridge predictions, I came across the standard result that
$$ \text{Var}(\hat{h}^\lambda(x) \mid X_{1:n}) = \sigma^2 \sum_{j=1}^{d} \frac{d_j \, v_j^2}{(d_j + \lambda)^2} $$where the $d_j$ are eigenvalues of $\Phi^\top \Phi$ and $v = U^\top \varphi(x)$ rotates your feature vector into the eigenbasis.
In order to prove this, you show each summand is strictly decreasing in $\lambda$ (take the derivative, get $-2d_j v_j^2/(d_j+\lambda)^3 < 0$).
But thinking about this expression afterwards, I thought something about it seemed familiar.
Those eigenvalues $d_j$ are doing two things. They're the eigenvalues of the Gram matrix $\Phi^\top \Phi$, which means they're also the squared singular values of $\Phi$, which means they're the principal component variances of the design matrix.
The Ridge penalty shrinks coefficients along the principal component directions (with shrinkage inversely proportional to how much variance each component explains).
This is the connection between Ridge and PCA which was unintuitive to me at least.
This is the part I found unintuitive.
So why does a supervised method end up organising its work around a coordinate system defined by an unsupervised decomposition?
It comes down to the fact that Ridge's penalty is isotropic.
The labels $y$ still matter:
One would think a method that uses labels and a method that ignores them would have nothing structural in common.
But the penalty term forces the estimator into a coordinate system that the labels never defined, but instead comes purely from the geometry of the features.
I think that division of labour is underappreciated.
We can see this by writing the design matrix $X$ in its singular value decomposition:
$$ X = U D V^\top $$where $U$ is $n \times p$ with orthonormal columns, $D = \text{diag}(d_1, \ldots, d_p)$ contains the singular values, and $V$ is $p \times p$ orthogonal.
The columns of $V$ are the principal component directions, and the singular values $d_j$ tell you how much variance each component captures.
The OLS estimator is:
$$ \hat{\beta}^{\text{OLS}} = (X^\top X)^{-1} X^\top y = V D^{-1} U^\top y $$Now consider what Ridge does. The Ridge estimator at penalty $\lambda$ is:
$$ \hat{\beta}^{\text{Ridge}} = (X^\top X + \lambda I)^{-1} X^\top y = V (D^2 + \lambda I)^{-1} D \, U^\top y $$To see the structure, rewrite both estimators in the principal component basis.
Now rotate into the PC coordinate system. Define $\theta = V^\top \beta$, the coefficients in the rotated coordinate system aligned with the principal components.
Then:
$$ \hat{\theta}_j^{\text{OLS}} = \frac{d_j \, u_j^\top y}{d_j^2} = \frac{u_j^\top y}{d_j} $$ $$ \hat{\theta}_j^{\text{Ridge}} = \frac{d_j}{d_j^2 + \lambda} \, u_j^\top y $$The ratio between the Ridge and OLS estimates for the $j$-th principal component is:
$$ \frac{\hat{\theta}_j^{\text{Ridge}}}{\hat{\theta}_j^{\text{OLS}}} = \frac{d_j^2}{d_j^2 + \lambda} $$Ridge regression shrinks each principal component's contribution by a factor of $d_j^2 / (d_j^2 + \lambda)$.
Now consider what happens if, instead of Ridge, you do Principal Component Regression (PCR): keep only the top $k$ principal components and run OLS in that subspace.
The shrinkage factor for PCR is:
$$ \frac{\hat{\theta}_j^{\text{PCR}}}{\hat{\theta}_j^{\text{OLS}}} = \begin{cases} 1 & \text{if } j \leq k \\ 0 & \text{if } j > k \end{cases} $$This is a hard threshold. You either keep a component completely or kill it entirely.
Ridge, by contrast, is a continuous relaxation of PCR as it applies a smooth shrinkage.
A few practical considerations follow from this:
There's also a satisfying Bayesian interpretation.
Ridge corresponds to a Gaussian prior $\beta \sim \mathcal{N}(0, \tau^2 I)$ on the coefficients, where $\lambda = \sigma^2 / \tau^2$.
But in the principal component basis, the likelihood is not isotropic.
The posterior combines isotropic prior with anisotropic likelihood, and the result is exactly the differential shrinkage we derived: Components where the data speak loudly ($d_j^2 \gg \lambda$) are barely regularised; components where the prior dominates ($d_j^2 \ll \lambda$) get pulled to zero.
If you instead put a prior that's aligned with the principal components (say $\theta_j \sim \mathcal{N}(0, d_j^2 \tau^2)$, giving more prior variance to high-variance directions) you'd get different shrinkage.
The point is that Ridge's implicit inductive bias is to shrink along principal components, weighted by inverse variance.
This result might be completely trivial/obvious to many of you, but it has changed how I think about both methods.
In choosing between Ridge and PCR, what you're doing isn't choosing between fundamentally different philosophies. You are choosing between a smooth curve and a step function over the same set of principal components.