0%

VC bound

前言:最近读Léon Bottou的18年2月的文章Optimization Methods for Large-Scale Machine Learning时,读到式(2.7)觉得怪怪的,于是特此将笔者看到过的有关VC bound的证明整理一下,分直观证明和理论证明去叙述:

  • 直观证明来自台湾大学林轩田的机器学习基石课程.
  • 理论证明来自Shai Shalve-Shwartz的UNDERSTANDING MACHINE LEARNING一书.

p.s. 读本文需要读者至少知道VC维的含义, VC界的含义.


motivation

Léon Bottou今年2月发的文章中式(2.7): \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \mathcal{O}\left(\sqrt{\frac{1}{2m}\log\left(\frac{2}{\delta}\right)+\frac{d_{\mathcal{H}}}{m}\log\left(\frac{m}{d_{\mathcal{H}}}\right)}\right) \]其中\(L_{S}(h)\)为训练集\(S\)上的误差,\(L_{\mathcal{D}}(h)\)为泛化误差,\(m\)为训练集\(S\)的大小,\(\delta\)为置信度(即以至少\(1-\delta\)的概率上式成立),\(d_{\mathcal{H}}\)为假设类\(\mathcal{H}\)的VC维,这个不等式的意思是:

  1. 范化误差与训练误差的差异的最大值会随着假设类\(\mathcal{H}\)的VC维\(d_{\mathcal{H}}\)的增大而减小
  2. 且要想保持VC bound不变,\(d_{\mathcal{H}}\)增大的同时样本量\(m\)必须与之同步增大

初看到这个结论我是懵逼的, 因为它不同于笔者之前看到的几个VC界:


VC bound 1

台大林轩田老师在机器学习基石课程中给出了一个VC bound(来自Lecrure7 page21), : \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \sqrt{\frac{8}{m}\log\frac{4\left(2m\right)^{d_{\mathcal{H}}}}{\delta}}\]整理得:\[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \sqrt{\frac{8}{m}\log\frac{4}{\delta}+\frac{8d_{\mathcal{H}}}{m}\log\left(2m\right)} \]上式均是在以不少于\(1-\delta\)概率成立的意义下. 细看这下会发现与Léon Bottou给出的式子有些许不同,下面来看其给出的直观证明,分5步走:

  1. \(L_{\mathcal{D}}\)没法处理, 设法用\(L_{S'}\)取代\(L_{\mathcal{D}}\): \[ \frac{1}{2}\mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right] \le \mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{S'}(h)\right|>\frac{\varepsilon}{2}\right] \]这一步是难点, 这篇文章page42的引理2给出了证明, 笔者在这里给出自己的直观理解: 显然有\(\mathbb{E}_{S\sim \mathcal{D}}L_{S}(h)=L_{\mathcal{D}}(h)\), 简单将\(L_{S}(h)\)视为正态分布, 其期望则是\(L_{\mathcal{D}}(h)\), 若存在一个\(h\), \[\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\]\(L_{S'}(h)\)大概率会落在离\(L_{\mathcal{D}}(h)\)比较近的区域, 也就是大概率会有:\[\left|L_{S}(h)-L_{S'}(h)\right|>\frac{\varepsilon}{2}\]
  2. 期望找到一个多项式增长的函数\(\tau_{\mathcal{H}}(m)\)(事实上,\(\tau_{\mathcal{H}}(m)=\max_{|C|=m}\left|\mathcal{H}_{C}\right|\),其中\(\mathcal{H}_{C}\)\(\mathcal{H}\)限制在\(C\)上)满足 \[ \begin{aligned}\mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right] & \le 2\mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{S'}}(h)\right|>\frac{\varepsilon}{2}\right]\\\ & = 2\mathbb{P}\left[\exists h \in\mathcal{H} s.t.\left|L_{S}(h)-L_{\mathcal{S'}}(h)\right|>\frac{\varepsilon}{2}\right]\\\ &\le 2\tau_{\mathcal{H}}(m)\mathbb{P}\left[{\rm fixed} \ h \ s.t.\left|L_{S}(h)-L_{\mathcal{S'}}(h)\right|>\frac{\varepsilon}{2}\right] \end{aligned} \]
  3. 考虑霍夫丁不等式\(\mathbb{P}\left\{\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right\}\le2\exp\left(-2\varepsilon^{2}m\right)\): \[ \mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right] \le 4\tau_{\mathcal{H}}(m)\exp\left(-2\left(\frac{\varepsilon}{2}\right)^{2}m\right) \]注意到此时\(\mathcal{D}\)替换成了\(S'\), 而\(S\),\(S'\)均有\(m\)个样本故将\(\tau_{\mathcal{H}}(m)\)修正为\(\tau_{\mathcal{H}}(2m)\), 另外 \[ \left|L_{S}(h)-L_{\mathcal{S'}}(h)\right|>\frac{\varepsilon}{2}\Leftrightarrow\left|L_{S}(h)-\left(L_{\mathcal{S'}}(h)+L_{S}(h)\right)/2\right|>\frac{\varepsilon}{4} \]\(\frac{\varepsilon}{2}\)修正为\(\frac{\varepsilon}{4}\) ,得到: \[ \mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right] \le 4\tau_{\mathcal{H}}(2m)\exp\left(-2\left(\frac{\varepsilon}{4}\right)^{2}m\right) \]这个式子也能在上面那篇文章page42的定理2中找到.
  4. 归纳法可证:\(\tau_{\mathcal{H}}(m)\le\sum_{i=0}^{d_{\mathcal{H}}}\binom{m}{i}\),林老师在这里粗糙的放大成\(\sum_{i=0}^{d_{\mathcal{H}}}\binom{m}{i}<m^{d_{\mathcal{H}}}\),整理得到: \[ \mathbb{P}\left\{\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right\}\le4 (2m)^{d_{\mathcal{H}}}\exp\left(-\frac{\varepsilon^{2}}{8}m\right) \]
  5. \(\delta=4 (2m)^{d_{\mathcal{H}}}\exp\left(-\frac{\varepsilon^{2}}{8}m\right)\), 即\(\varepsilon=\sqrt{\frac{8}{m}\log\frac{4\left(2m\right)^{d_{\mathcal{H}}}}{\delta}}\), 有: \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \sqrt{\frac{8}{m}\log\frac{4\left(2m\right)^{d_{\mathcal{H}}}}{\delta}} \]

思路叙述完毕,细看之下,4中不等式放的太宽了,事实上,当\(m>d_{\mathcal{H}}+1\)有如下不等式成立:\(\sum_{i=0}^{d_{\mathcal{H}}}\binom{m}{i}\le \left(\frac{em}{d_{\mathcal{H}}}\right)^{d_{\mathcal{H}}}\)(证明可见UNDERSTABDING MACHINE LEARNING一书附录A.5),此时VC bound变成: \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \sqrt{\frac{8}{m}\log\frac{4(\frac{2em}{d_{\mathcal{H}}})^{d_{\mathcal{H}}}}{\delta}} \]整理得: \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le \sqrt{\frac{8}{m}\log\left(\frac{4}{\delta}\right)+\frac{8d_{\mathcal{H}}}{m}\log\left(\frac{2em}{d_\mathcal{H}}\right)} \] p.s. 显然有\(\mathbb{P}\left[\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right]\le\mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|>\varepsilon\right]\), 就能轻易得到西瓜书定理12.2及定理12.3的结论.


VC bound 2

Shai Shalve-Shwartz的UNDERSTABDING MACHINE LEARNING一书中定理6.11也给出了一个VC bound: \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le\frac{2+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\delta\sqrt{2m}} \]同样上式是以\(1-\delta\)的概率成立. 证明分8步走:

  1. 引理1: \[ \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\right]\le\frac{2+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\sqrt{2m}} \]现先假定引理1成立,那么由随机变量\(\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\)显然是非负的以及马尔可夫不等式\(\mathbb{P}[X\ge \delta]\le\frac{\mathbb{E}[X]}{\delta}\)可直接推得结论成立: \[ \mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\ge\frac{2+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\delta\sqrt{2m}}\right] \le \frac{\mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\right]}{\frac{2+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\delta\sqrt{2m}}}\le \delta \]\[ \mathbb{P}\left[\sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le\frac{2+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\delta\sqrt{2m}}\right] \ge 1-\delta \]

  2. 问题转化为证引理1,考虑引理1的左半部分: \[ \begin{aligned} \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\right] & = \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|\mathop{\mathbb{E}}_{S'\sim\mathcal{D}^{m}}L_{S'}(h)-L_{S}(h)\right|\right] \\\ & \le \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\mathop{\mathbb{E}}_{S'\sim\mathcal{D}^{m}}\left|L_{S'}(h)-L_{S}(h)\right|\right]\\\ & \le \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \mathop{\mathbb{E}}_{S'\sim\mathcal{D}^{m}}\sup_{h\in\mathcal{H}}\left|L_{S'}(h)-L_{S}(h)\right|\right]\\\ & = \mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|L_{S'}(h)-L_{S}(h)\right|\right]\\\ & = \mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right] \end{aligned} \] 其中第一个等号是因为\(L_{\mathcal{D}}(h)=\mathbb{E}_{S'\sim\mathcal{D}^{m}}L_{S'}(h)\),第二个不等号为绝对值放到积分里面,第三个不等号为期望的上界小于等于上界的期望,第四个等号为两个期望合并,第五个等号为\(L_{S}(h)\)写开来,且这里的\(l(h,x_{i})\)\(h\)下的0-1损失

  3. 引入随机变量\(\sigma\in\{\pm 1\}^{m}\),2.中最右端等价于: \[ \begin{aligned} \mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right] & = \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right]\\\ & = \mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}} \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right] \end{aligned} \]其中第一个等号为\(\pm1\)对绝对值无影响,第二个等号为交换期望

  4. 考虑3.最右端,固定\(S,S'\),令\(C=S\cup S'\),那么此时\(\mathcal{H}\)仅会涉及\(C\)上的假设,即可以把\(\mathcal{H}\)限制在\(C\)上: \[ \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right]= \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\left[ \max_{h\in\mathcal{H}_{C}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right] \]

  5. 固定\(h\in\mathcal{H}_{C}\),并记\(\theta_{h} \triangleq \frac{1}{m}\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\),则有 \[ \begin{aligned}\mathbb{E}\theta_{h} & =\mathbb{E}\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right) \\\ & = \frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\\\ & = 0 \end{aligned} \]\(\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\)是在\([-1,1]\)(确切说是\(\{-1,0,1\}\))上取值的独立随机变量,因此满足霍夫丁不等式的条件,得\(\forall \rho>0,\) \[ \mathbb{P}\left[\left|\theta_{h}\right|>\rho\right]\le2\exp(-2m\rho^{2}) \]\[ \mathbb{P}\left[\frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|>\rho\right]\le2\exp(-2m\rho^{2}) \]

  6. 注意到此时\(\mathcal{H}_{C}\)仅包含有限个假设,由联合界(简言之,\(\mathbb{P}(A\cup B)\le\mathbb{P}(A)+\mathbb{P}(B)\))可得: \[ \mathbb{P}\left[\max_{h\in\mathcal{H}_{C}}\left|\theta_{h}\right|>\rho\right]\le2\left|\mathcal{H}_{C}\right|\exp(-2m\rho^{2}) \]\[ \mathbb{P}\left[\max_{h\in\mathcal{H}_{C}}\frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|>\rho\right]\le2\left|\mathcal{H}_{C}\right|\exp(-2m\rho^{2}) \] 第7步用到一个引理2: > \(X\)为一随机变量,\(x\in\mathbb{R}\)为一标量,假定存在\(a>0\)\(b>e\),使得对\(\forall t>0\),有\(\mathbb{P}\left[\left|X-x\right|>t\right]\le2be^{-t^2/a^2}\),那么 \[ \mathbb{E}\left[\left|X-x\right|\right]\le a(2+\sqrt{\log{b}}) \]

  7. 引理2的证明可见UNDERSTABDING MACHINE LEARNING一书附录A.4,将\(X=\max_{h\in\mathcal{H}_{C}}\left|\theta_{h}\right|\),\(x=0\),\(b=\left|\mathcal{H}_{C}\right|\),\(a=1/\sqrt{2m}\)带入得: \[ \mathbb{E}\left[\max_{h\in\mathcal{H}_{C}}\left|\theta_{h}\right|\right]\le\frac{2+\sqrt{\log\left(\left|\mathcal{H}_{C}\right|\right)}}{\sqrt{2m}} \]即有4.的最右端 \[ \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\left[ \max_{h\in\mathcal{H}_{C}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right]\le\frac{2+\sqrt{\log\left(\left|\mathcal{H}_{C}\right|\right)}}{\sqrt{2m}} \]再由定义\(\tau_{\mathcal{H}}(m)=\max_{|C|=m}\left|\mathcal{H}_{C}\right|\),以及\(C\)最多包含\(2m\)个样本得3.的最右端: \[ \mathop{\mathbb{E}}_{S,S'\sim\mathcal{D}^{m}} \mathop{\mathbb{E}}_{\sigma\sim U_{\pm}^{m}}\left[ \sup_{h\in\mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^{m}\sigma_{i}\left(l(h,x_{i}')-l(h,x_{i})\right)\right|\right] \le\frac{2+\sqrt{\log\left(\tau_{\mathcal{H}}(2m)\right)}}{\sqrt{2m}} \]再由2.即 \[ \mathop{\mathbb{E}}_{S\sim\mathcal{D}^{m}}\left[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\right]\le\frac{2+\sqrt{\log\left(\tau_{\mathcal{H}}(2m)\right)}}{\sqrt{2m}} \]即引理1得证

仔细审视之下,7.中的引理2的证明似乎有些问题,下面直接看修正后的引理2的证明: 引理2: >\(X\)为一随机变量,\(x\in\mathbb{R}\)为一标量,假定存在\(a>0\)\(b>e^{4}\),使得对\(\forall t>0\),有\(\mathbb{P}\left[\left|X-x\right|>t\right]\le2be^{-t^2/a^2}\),那么 \[ \mathbb{E}\left[\left|X-x\right|\right]\le a(2e^{2\sqrt{\log(b)}-1}+\sqrt{\log{b}}) \]

证明:令\(t_{i}=a\left(i+\sqrt{\log(b)}\right),i=0,1,...\),有: \[ \begin{aligned}\mathbb{E}\left[\left|X-x\right|\right]& \le a\sqrt{\log(b)}\mathbb{P}\left[\left|X-x\right|\le a\sqrt{\log(b)}\right]+\sum_{i=1}^{+\infty}t_{i}\mathbb{P}\left[t_{i-1}<\left|X-x\right|\le t_{i}\right]\\\ & \le a\sqrt{\log(b)}+\sum_{i=1}^{+\infty}t_{i}\mathbb{P}\left[\left|X-x\right|> t_{i-1}\right]\\\ & \le a\sqrt{\log(b)}+\sum_{i=1}^{+\infty}a\left(i+\sqrt{\log(b)}\right)2be^{-\left(i-1+\sqrt{\log(b)}\right)^{2}}\\\ & = a\sqrt{\log(b)}+2ab\sum_{i=1}^{+\infty}\left(i+\sqrt{\log(b)}\right)e^{-\left(i-1+\sqrt{\log(b)}\right)^{2}}\\\ & \le a\sqrt{\log(b)}+2ab\int_{\sqrt{\log(b)}}^{+\infty} xe^{-\left(x-1\right)^{2}} dx\\\ & = a\sqrt{\log(b)}+2ab\int_{\sqrt{\log(b)}-1}^{+\infty} (y+1)e^{y^{2}} dy\\\ & \le a\sqrt{\log(b)}+4ab\int_{\sqrt{\log(b)}-1}^{+\infty} ye^{y^{2}} dy\\\ & = a\sqrt{\log(b)}+2ab\left[-e^{-y^2}\right]_{\sqrt{\log(b)}-1}^{+\infty}\\\ & = a\sqrt{\log(b)}+2abe^{-\log(b)+2\sqrt{\log(b)}-1}\\\ & = a\sqrt{\log(b)}+2ae^{2\sqrt{\log(b)}-1}\\\ & = a(2e^{2\sqrt{\log(b)}-1}+\sqrt{\log{b}}) \end{aligned} \] 第一个不等号为分区间放大期望,第二个不等号为把概率放大,第三个不等号为将条件带入,第四个等号为常数放到前面,第五个不等号可由\(xe^{-\left(x-1\right)^{2}}\)的单调递减性得到(原证明则是在这一步的积分下限错误的放大为\(1+\sqrt{\log(b)}\)),第六个不等号为变量代换,第七个不等号由\(b>e^{4}\)知积分下限大于1,故可将将1放大为\(y\),第八个等号为求积分,第九、十和十一个等号均为化简.那么上述第7步应当做相应修正,此时的VC bound应当修正为: \[ \sup_{h\in\mathcal{H}}\left|L_{S}(h)-L_{\mathcal{D}}(h)\right|\le\frac{2e^{2\sqrt{\log\left(\tau_{\mathcal{h}}(2m)\right)}-1}+\sqrt{\log\left( \tau_{\mathcal{h}}(2m) \right)}}{\delta\sqrt{2m}} \]p.s.上面那个界可以说非常丑, 只能怪笔者没能把引理中后面那部分放缩到简单情形, 望读者老爷们不吝赐教. 最后要注意的是该VC bound要求假设类\(\mathcal{H}\)的VC维要满足\(b=\left|\mathcal{H}_{C}\right|>e^{4}\)