An Identity For Kernel Ridge Regression
This paper describes an approach to prove theoretical guarantees on online kernelized Ridge Regression in the form of equality. It approaches the results described in Zhdanov and Vovk, 2009, from the probabilistic perspective.
The method of kernel Ridge Regression with a
kernel ⚠ $\mathcal{K}$
and a real regularisation parameter ⚠ $a>0$
suggests the
function ⚠ $f_\mathrm{RR}(x)=Y'(K+aI)^{-1}k(x)$
, where ⚠ $Y=(y_1,y_2,\ldots,y_T)'$
is
the column vector of outcomes,
⚠ $$
K= \begin{pmatrix} \mathcal{K}(x_1,x_1) & \mathcal{K}(x_1,x_2) & \ldots & \mathcal{K}(x_1,x_T)
\mathcal{K}(x_2,x_1) & \mathcal{K}(x_2,x_2) & \ldots & \mathcal{K}(x_2,x_T)
\vdots & \vdots & \ddots & \ldots
\mathcal{K}(x_T,x_1) & \mathcal{K}(x_T,x_2) & \ldots & \mathcal{K}(x_T,x_T)
\end{pmatrix}⚠ $$
is the kernel matrix and
⚠ $\displaystyle{ k(x)= \begin{pmatrix} \mathcal{K}(x_1,x)\\ \mathcal{K}(x_2,x)\\ \vdots \\ \mathcal{K}(x_T,x) \end{pmatrix}\enspace.}$
Theorem. Take a kernel ⚠ $\mathcal{K}$
on a domain ⚠ $X$
and a parameter ⚠ $a>0$
. Let
⚠ $\mathcal{F}$
be the RKHS for the kernel ⚠ $\mathcal{K}$
. For a sample
⚠ $(x_1,y_1),(x_2,y_2),\ldots,(x_T,y_T)$
let
⚠ $\gamma_1^\mathrm{RR},\gamma_2^\mathrm{RR},\ldots,\gamma_T^\mathrm{RR}$
be the predictions output by ridge
regression with the kernel ⚠ $\mathcal{K}$
and the parameter ⚠ $a$
in the
on-line mode. Then
⚠ $\displaystyle{ \sum_{t=1}^T\frac{(\gamma^\mathrm{RR}_t-y_t)^2}{1+d_t/a}=\min_{f\in\mathcal{F}}\left( \sum_{t=1}^T(f(x_t)-y_t)^2+a\|f\|^2_\mathcal{F}\right) =aY'(K_{T+1}+aI)^{-1}Y\enspace,}$
where ⚠ $d_t=\mathcal{K}(x_t,x_t)-k'_t(x_t)(K_t+aI)^{-1}k_t(x_t)>0$
.
The idea of the proof is very elegant: the density of the joint distribution of the variables
⚠ $(y_{x_1},y_{x_2},\ldots,y_{x_T})$
at the point
⚠ $(y_1,y_2,\ldots,y_T)$
is calculated in three different ways: by
decomposing the density into a chain of conditional densities,
marginalisation, and, finally, direct calculation. Each method will
give us a different expression corresponding to a term in the
identity. Since all the three terms express the same density, they
must be equal.
The following corollary is also proven on the cumulative square loss of the clipped kernel Ridge Regression.
Corollary.
Take a kernel ⚠ $\mathcal{K}$
on a domain ⚠ $X$
and a parameter ⚠ $a>0$
. Let
⚠ $\mathcal{F}$
be the RKHS for the kernel ⚠ $\mathcal{K}$
. For a sample
⚠ $(x_1,y_1),(x_2,y_2),\ldots,(x_T,y_T)$
such that ⚠ $y_t\in [-Y,Y]$
for
all ⚠ $t=1,2,\ldots,T$
, let
⚠ $\gamma_1^{\mathrm{RR},Y},\gamma_2^{\mathrm{RR},Y},\ldots,\gamma_T^{\mathrm{RR},Y}$
be the
predictions output by clipped ridge regression with the kernel
⚠ $\mathcal{K}$
and the parameter ⚠ $a$
in the on-line mode. Then
⚠ $\displaystyle{\sum_{t=1}^T(\gamma^{\mathrm{RR},Y}_t-y_t)^2\le \min_{f\in\mathcal{F}}\left( \sum_{t=1}^T(f(x_t)-y_t)^2+a\|f\|^2_\mathcal{F}\right)+4Y^2\ln\det\left(I+\frac1a K_{T+1}\right)\enspace,}$
where ⚠ $K_{T+1}$
is as above.
- Fedor Zhdanov and Vladimir Vovk. Competing with Gaussian linear experts. Technical report, arXiv:0910.4683 [cs.LG], arXiv.org e-Print archive, 2009.
- Fedor Zhdanov and Yuri Kalnishkan. An identity for kernel ridge regression. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, 2010.