假设有$n$个样本,$p$个自变量(predictor variables)。
考虑简单线性回归模型:
$$y = X\beta + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2)$$
其中,$y$是响应变量/因变量/被解释变量,$X$是$n\times p$维的样本(设计)矩阵,$\varepsilon$是服从方差为$\sigma^2$的正态分布的误差项。
普通最小二乘(OLS,Ordinary Least Squares):
- 损失函数(Loss function):
$$L(\beta) = ||y-X\beta||^2$$ - 回归系数估计:
$$\hat{\beta}_{OLS} = \left(X^T X \right)^{-1} X^T y$$
$L_p$范数($L_p$-norm):
$$L_p(\vec{x}) = \parallel \vec{x} \parallel_p = \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p}, \qquad p\geq 1$$
其中,$\vec{x} = ( x_1, x_2, \cdots, x_n )$。
- $p = -\infty$:
$$\parallel \vec{x} \parallel_{-\infty} = \lim_{p \rightarrow -\infty} \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p} = \min_i \mid x_i \mid $$- $p = 0$:(严格说,不属于范数)
$$\parallel \vec{x} \parallel_0 = \sharp(i) \quad \mathrm{with} \quad x_i\neq 0 $$
表示向量$\vec{x}$中非零元素的个数- $p = 1$:(也称曼哈顿距离)
$$\parallel \vec{x} \parallel_1 = \sum_{i=1}^n \mid x_i \mid$$- $p = 2$:(也称欧氏距离)
$$\parallel \vec{x} \parallel_2 = \sqrt{\sum_{i=1}^n \mid x_i \mid^2}$$- $p = +\infty$:(无穷范数/最大范数)
$$\parallel \vec{x} \parallel_{+\infty} = \lim_{p\rightarrow +\infty} \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p} = \max_i \mid x_i \mid$$
- 当$p>n$时,OLS不适用(容易过拟合)
Penalized Least Squares
Li F R . Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties[J]. Publications of the American Statistical Association, 2001, 96(456):1348-1360.
将设计矩阵标准化,即$X^TX = I_p$
目标函数:
$$
\begin{aligned}
L( \beta ) &= \parallel y - X \beta \parallel_2^2 + \lambda J( \beta ) \\
&= \parallel y-X \beta \parallel_2^2 + \lambda \sum_{j=1}^p p_j ( \mid \beta_j \mid ) \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ ij } \beta_j \right)^2 + \lambda \sum_{j=1}^p p_j ( \mid \beta_j \mid ) \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ ij } \beta_j \right)^2 + \sum_{j=1}^p p_\lambda( \mid \beta_j \mid ) \\
&= \parallel y-X \beta \parallel^2_2 + \sum_{j=1}^p p_\lambda ( \mid \beta_j \mid ) \\
&= \parallel y-X \hat{ \beta }{ OLS } \parallel^2_2 + \parallel \hat{ \beta }{ OLS } - \beta \parallel^2 + \sum_{j=1}^p p_\lambda ( \mid \beta_j \mid ) \\
&= \parallel y - X \hat{ \beta }{ OLS } \parallel^2_2 + \sum{j=1}^p
\left\lbrace ( \hat{ \beta }{ oj } - \beta_j )^2 + p\lambda ( \mid \beta_j \mid ) \right\rbrace \\
\end{aligned}
$$
- 假设惩罚函数对所有的系数都是一样的,即 $p(\mid \cdot \mid)$
- 进一步,将 $\lambda p( \mid \cdot \mid )$ 记为 $p_\lambda( \mid \cdot \mid )$
- $ \hat{ \beta }{ OLS } $ 是回归方程的普通最小二乘估计,$ \hat{ \beta }{ oj } $ 是 $\hat{ \beta }_{ OLS } $ 的第 $j$ 个元素
$$\hat{ \beta }_{ OLS } = \left( X^T X \right)^{-1} X^T y = X^T y$$
$\parallel y-X \hat{ \beta_{OLS} } \parallel^2_2$ 对于 $\beta$ 是常数,因此有
$$\arg \min_{\beta_j} \parallel y-X\beta \parallel^2_2 + p_\lambda ( \mid \beta_j \mid ) \Leftrightarrow \arg \min_{ \beta_j } \left\lbrace ( \hat{ \beta }{oj} - \beta_j )^2 + p\lambda( \mid \beta_j \mid ) \right\rbrace $$
$$\hat{ \beta }j = \arg \min{ \beta_j } \left\lbrace ( \hat{\beta}{oj} - \beta_j )^2 + p\lambda( \mid \beta_j \mid ) \right\rbrace$$
- 对于岭回归(Ridge Regression),
$$
\begin{aligned}
p_\lambda ( \mid \beta_j \mid ) &= \lambda \mid \beta_j \mid^2 \hat{ \beta }^{ Ridge }{j} \\
&= \frac{ 1 }{ 1 + \lambda } \hat{ \beta }{oj} \\
\end{aligned}
$$
- 对于LASSO,
$$
\begin{aligned}
p_\lambda(\mid \beta_j \mid) &= \lambda \mid \beta_j \mid \hat{ \beta }^{ LASSO }{j} \\
&= \mathrm{ sgn }( \hat{ \beta }{oj} )
\left( \mid \hat{ \beta }{oj} \mid - \frac{ \lambda }{ 2 } \right )+ \\
\end{aligned}
$$
Rigde Regression 岭回归
当自变量个数超过样本个数 或 样本数据存在多重共线性(multicollinearity)时,使用岭回归(Ridge Regression)可以得到一个精简的模型。
Hoerl, A., & Kennard, R. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics, 12(1), 69-82. doi:10.2307/1267352
- ridge estimator是一种shrinkage estimator
Shrinkage estimators theoretically produce new estimators that are shrunk closer to the ‘true’ population parameters.
-
ridge regression是Tikhivov方法的其中一种
-
ridge regression使用L2正则化(L2 Regularization)
-
目标函数:
\begin{aligned}
L_{Ridge}(\beta) &= ||y-X\beta ||^2_2 + \lambda ||\beta||^2_2 \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \\
\end{aligned}
- 回归系数估计:
\begin{aligned}
\hat{\beta}{Ridge} &= \arg \min\beta \sum_{i=1}^n \left(y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 \quad \mathrm{s.t.} \sum_{j=1}^p \beta_j^2 \leq s \\
&= \left( X^T X+\lambda I \right)^{-1} X^T y \\
\end{aligned}
其中,$I$是单位矩阵,$\lambda$是正则惩罚参数。
- 当$\lambda \rightarrow 0$时,有 $\hat{\beta}{Ridge} \rightarrow \hat{\beta}{OLS}$
- 当$\lambda \rightarrow \infty$时,有 $\hat{\beta}_{Ridge} \rightarrow 0$
Bias-Variance Trade-off
\begin{aligned}
\mathrm{Bias}( \hat{\beta}{Ridge} ) &= -\lambda \left( X^T X+\lambda I \right)^{-1} \beta \\
\mathrm{Var}( \hat{\beta}{Ridge} ) &= \sigma^2 \left( X^T X + \lambda I \right)^{-1} X^T X \left( X^T X+\lambda I \right)^{-1} \\
\end{aligned}
如何选择$\lambda$
方法:
- 最小化信息准则(Minimizing Information Criteria):
- 交叉验证(cross-validation)
R实现
可使用R包glmnet
- 注意:ridge regression假设自变量(predictors)是标准化的(standardized),且因变量是中心化的(centered)
标准化、中心化见《Post not found: 算法-特征归一化》
1 | library(glmnet) ## 用于岭回归 |
1 | ## 使用令交叉验证预测误差均值最小的lambda |
Python实现
LASSO
LASSO(Least Absolute Shrinkage and Selection Operator)
- Tibshirani, Robert. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–288. JSTOR, www.jstor.org/stable/2346178. Accessed 11 Feb. 2021.
- Li, F. R. . (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Publications of the American Statistical Association, 96(456), 1348-1360.
-
LASSO在损失函数中加上一个L1惩罚项(L1 penalty)
-
目标函数:
\begin{aligned}
L_{LASSO}(\beta) &= || y-X \beta ||^2_2 + \lambda ||\beta||1 \\
&= \sum{i=1}^n \left(y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p \mid \beta_j \mid \\
\end{aligned}
$$\hat{\beta}{LASSO} = \arg \min{\beta} \left\lbrace \sum_{i=1}^n\left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 \right\rbrace \qquad \mbox{subject to } \sum_{j=1}^p |\beta_j| \leq s$$
- 标准LASSO过度惩罚了大系数(standard LASSO over-shrinks large coefficients due to the nature of $l_1$ penalty)。
- Fan and Li, 2001:在单数据集变量选择中,LASSO倾向于选择过多的变量,理论上不满足Oracle性质
缺点:
- LASSO最终选择的变量个数$c$不会超过样本个数$n$($c = \min\lbrace n, p \rbrace$)
- LASSO在对高度相关的变量进行选择时,只选择其中一个,而不关心是哪一个
Fused LASSO
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005), Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67: 91-108.
$$\hat{\beta}{fLASSO} = \arg \min{\beta} \left\lbrace \sum_{i=1}^n\left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 \right\rbrace \qquad \mbox{subject to } \sum_{j=1}^p |\beta_j| \leq s_1 \mbox{ and } \sum_{j=2}^p |\beta_j - \beta_{j-1}| \leq s_2$$
- $\sum_{j=1}^p |\beta_j| \leq s_1$鼓励回归系数的稀疏性
- $\sum_{j=2}^p |\beta_j - \beta_{j-1}| \leq s_2$鼓励回归系数差异上的稀疏性
Adaptive LASSO
Hui Zou (2006). The Adaptive Lasso and Its Oracle Properties, Journal of the American Statistical Association, 101:476, 1418-1429.
考虑加权LASSO
$$\arg \min_{\beta} \parallel y - \sum_{j=1}^p x_j\beta_j \parallel^2_2 + \lambda \sum_{j=1}^p w_j|\beta_j|$$
- 如果权重$w_j$是基于数据、且被正确选择,则加权LASSO具有Oracle性质,可称为自适应LASSO(Adaptive LASSO)
定义$\hat{w}=\frac{1}{|\hat{\beta}|^{\gamma} }$自适应LASSO通过下式估计$\hat{\beta}^{(n)}$
$$\hat{\beta}^{(n)} = \arg \min_{\beta} \parallel y-\sum_{j=1}^p x_j\beta_j \parallel^2_2 + \lambda_n \sum_{j=1}^p \hat{w}_j |\beta_j|$$
- $\hat{\beta}$可取$\hat{\beta}_{OLS}$
post-LASSO
Belloni, Alexandre; Chernozhukov, Victor. Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2013), no. 2, 521–547. doi:10.3150/11-BEJ410.
- Belloni and Chernozhukov (2013):post-LASSO在收敛速度上至少和LASSO一样好,且post-LASSO的偏差更小
Bridge
Frank L E , Friedman J H . A Statistical View of Some Chemometrics Regression Tools[J]. Technometrics, 1993, 35(2):109-135.
Fu, Wenjiang J . Penalized Regressions: The Bridge versus the Lasso[J]. Journal of Computational and Graphical Statistics, 1998, 7(3):397.
- 目标函数:
$$L_{Bridge}(\beta) = \parallel y-X\beta \parallel_2 + \lambda \sum_{i=1}^p \mid \beta_i \mid^\gamma, \qquad \gamma > 0 $$
$$
\begin{aligned}
\hat{\beta}{Bridge} &= \arg \min{\beta} \parallel y - X\beta \parallel_2^2 + \lambda J(\beta) \\
&= \arg \min_{\beta} \parallel y - X\beta \parallel_2^2 + \lambda \parallel \beta \parallel_{\gamma}^{\gamma} \\
&= \arg \min_{\beta} \parallel y - X\beta \parallel_2^2 + \lambda \sum_{j=1}^p \mid \beta_j\mid^{\gamma} \\
\end{aligned}
$$
Zou and Hastie(2005):Bridge估计量可以看成是基于如下先验的贝叶斯后验模式(Bayes posterior mode)
$$p_{\lambda, \gamma}(\beta) = C(\lambda, \gamma)\exp{(-\lambda |\beta|_{\gamma}^{\gamma})} $$
SCAD
SCAD(Smoothly Clipped Absolute Deviation)
Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348-1360.
目标函数:
$$ P_{\lambda}(|\beta_j|) = \left\lbrace
\begin{array}{ll}
{\lambda} |\beta_j|, & 0\leq |\beta_j| \leq \lambda \\
-\frac{\left( |\beta_j|^2 - 2a\lambda |\beta_j| + \lambda^2 \right)}{2(a-1)}, & \lambda < |\beta_j| \leq a\lambda \\
\frac{(a+1)\lambda^2}{2}, & |\beta_j| > a\lambda \\
\end{array}
\right.
$$
$$ P_{\lambda}^{\prime} (|\beta_j|) = \lambda \left\lbrace I(|\beta_j| \leq \lambda) + \frac{(a\lambda - |\beta_j|)_{+} }{(a-1)\lambda} I(|\beta_j| > \lambda) \right\rbrace$$
其中$a>2$。
Elastic Net(Enet)
- EN是最早的具有群组变量选择功能的方法(王小燕等, 2015)
- 惩罚函数是LASSO和岭回归的线性组合
- 处理高度相关数据的组变量选择方法
- Zou, H. , & Hastie, T. . (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 67(5), 768-768.
- 王小燕,谢邦昌,马双鸽,方匡南.高维数据下群组变量选择的惩罚方法综述[J].数理统计与管理,2015,34(06):978-988.
目标函数:
\begin{aligned}
L(\lambda, \beta) &= \parallel y-X\beta \parallel_2^2 + \lambda_2 \parallel \beta\parallel 2^2 + \lambda_1 \parallel \beta \parallel_1 \\
&= \parallel y-X\beta \parallel_2^2 + \lambda_2 \sum{j=1}^p \beta_j^2 + \lambda_1 \sum_{j=1}^p |\beta_j| \\
\end{aligned}
\begin{aligned}
\hat{\beta} &= \arg\min_{\beta} \lbrace L(\lambda, \beta) \rbrace \\
&= \arg\min_{\beta} \parallel y-X\beta \parallel_2^2, \qquad \mbox{subject to } (1-\alpha)\mid\beta\mid_1+\alpha\parallel \beta\parallel^2_2 \leq t \mbox{ for some } t.\\
\end{aligned}
- 当$\alpha = 1$时,朴素弹性网(naive elastic net)变为简单岭回归(simple ridge regression)
- 当$\alpha = 0$时,朴素弹性网(naive elastic net)变为LASSO
Elastic Net估计量可以看成是如下先验(介于Gaussian先验和Laplacian先验)的贝叶斯后验模式:
$$p_{\lambda, \alpha} = C(\lambda, \alpha) \exp\lbrace -\lambda \left[ \alpha |\beta|^2 + (1-\alpha)|\beta|_1 \right] \rbrace$$
- Zou and Hui (2015):Elastic Net 没有Oracle性质
- 王小燕等(2015):弹性网的缺点之一是往往选择过多的变量组
EN vs Bridge
- $1< \gamma < 2$的Bridge回归与Elastic Net有许多相似之处
- Elastic Net可以产生稀疏解,而Bridge Regression不能
Fan and Li (2001):在$L_q(q\geq 1)$的惩罚函数族中,只有LASSO惩罚函数($q=1$)能够产生稀疏解(sparse solution)
Adaptive Elastic Net
结合Adaptive LASSO和$L_2$惩罚项
MCP
MCP(minimax concave penalty)
- 用于单一数据集的分析
- Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894-942.
- Liu, J., Huang, J., & Ma, S. (2014). Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization. Scandinavian journal of statistics, theory and applications, 41(1), 87–103.
- Mazumder, R., Friedman, J. H., & Hastie, T. (2011). SparseNet: Coordinate Descent With Nonconvex Penalties. Journal of the American Statistical Association, 106(495), 1125–1138.
MCP定义为:
$$
\begin{aligned}
\rho(t;\lambda,\gamma) &= \lambda \int_{0}^{|t|} \left( 1-\frac{x}{\gamma \lambda} \right){+} \mathrm{d}x \\
&= \lambda \left( |t| - \frac{t^2}{2\gamma\lambda} \right)I(|t| < \gamma\lambda) + \frac{\gamma\lambda^2}{2}I(|t| \geq \gamma\lambda) \\
\end{aligned}
$$
即
$$ P{MCP}(t;\lambda, \gamma) =
\left\lbrace
\begin{array}{ll}
\lambda|t| - \frac{t^2}{2\gamma}, & |t| < \gamma\lambda \\
\frac{\gamma\lambda^2}{2}, & |t| \geq \gamma \lambda \\
\end{array}
\right.
$$
- 正则化参数$\gamma > 0$控制$\rho(\cdot)$的凹度(concavity)
- 惩罚参数:$\lambda$
- $x_{+} = xI(x\geq 0)$
导数为:
$$\rho^\prime(t;\lambda,\gamma) = \lambda\left( 1-\frac{|t|}{\gamma \lambda} \right){+}\mathrm{sgn}(t)$$
即
$$
P{MCP}^{\prime}(t;\lambda,\gamma) = \left\lbrace
\begin{array}{ll}
\lambda - \frac{|t|}{\gamma}, & |t| < \gamma\lambda \\
0, & |t| \geq \gamma \lambda \\
\end{array}
\right.
$$
其中
$$\mathrm{sgn}(t) = \left\lbrace
\begin{array}{ll}
-1, & t < 0 \\
0, & t = 0 \\
1, & t > 0
\end{array}
\right.
$$
- 当$\gamma \rightarrow +\infty$,有MCP penalty $\rightarrow$ LASSO penalty
- 当$\gamma \rightarrow 1+$,有MCP penalty $\rightarrow$ hard-thresholding penalty
- Mazumder, 2011:MCP计算简单
- Liu等人(2014)
Group Variable Selection
MNET
- 处理高度相关数据的组变量选择方法
$L_2$ SCAD
- 组合SCAD函数和岭回归
- 处理高度相关数据的组变量选择方法
Group LASSO
- 仅能选择组变量
- Yuan, M. and Lin, Y. (2006) Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society: Series B, 68, 49-67.
- Huang, J., Breheny, P., & Ma, S. (2012). A Selective Review of Group Selection in High-Dimensional Models. Statistical science : a review journal of the Institute of Mathematical Statistics, 27(4), 10.
- Wei, F., & Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability, 16(4), 1369–1384.
假设设计矩阵$X$可分为$J$个组$X_1, X_2, \cdots, X_J$,且$d_j$表示第$j$个组的大小(size),也即$\sum_{j=1}^Jd_j = p$。
目标函数:
\begin{equation}
\begin{aligned}
Q(\beta | X,y) &= L(\beta|X,y) + \sum_{j=1}^J \lambda_j {\parallel} \beta_j {\parallel_{K_j} } \qquad (\lambda \geq 0) \\
&= \parallel y - \sum_{j=1}^JX_j\beta_j {\parallel_2^2} + \lambda \sum_{j=1}^J \sqrt{d_j} \parallel \beta_j {\parallel_{K_j} } \\
\end{aligned}
\end{equation}
- ${\parallel}z{\parallel_{K_j} } = \left( z^TK_jz \right)^{1/2}$
- 为了保证对大组、小组的惩罚力度一致,可取$\lambda_j = \lambda \sqrt{d_j}$
- 当$d_j=1(1\leq j\leq J)$时,Group LASSO简化为标准LASSO,且$R_j=\frac{1}{n}\parallel X_j\parallel^2$正比于$X_j$的样本方差。
- Kim等(2006):将Group LASSO应用到Logistic模型中
- Meier等(2008):将Group LASSO应用到Logistic模型中
- Huang等(2009):Group LASSO就像是自适应加权岭回归(adaptively weighted ridge regression)
- Wei and Huang (2010):Group LASSO的选择没有一致性,且倾向于选取不重要的组
如何选择$K_j$
对于标准正交的$X_j$,有
$$\frac{1}{n}X_j^TX_j = I_{d_j}, \qquad j=1, 2, \cdots, J.$$
- Yuan and Lin(2006)建议取$K_j=I_{d_j}$
求解
- Yuan and Lin(2006):通过组坐标下降法(Group Coordinate Descent Algorithm)计算Group LASSO的解
令$z=\frac{1}{n}X_j^T y$为$y=X\beta+\varepsilon$的最小二乘解,则有
$$\hat{\beta}{LASSO}(z;\lambda) = S(z, \lambda) = \left( 1-\frac{\lambda}{\parallel z \parallel_2} \right){+} z$$
同构整合分析
Group LASSO可应用于同构整合分析
Zhang, Q. , Zhang, S. , Liu, J. , Huang, J. , & Ma, S. . (2015). Penalized integrative analysis under the accelerated failure time model. Statistica Sinica, 26(2).
目标函数:
\begin{aligned}
L(\beta) &= \frac{1}{2n}\parallel y-X\beta \parallel_2^2 + \lambda \sum_{j=1}^p\parallel \beta_j \parallel_2 \\
&= \frac{1}{2n}\parallel y-X\beta \parallel_2^2 + \lambda \sum_{j=1}^p \left[ \sum_{k=1}^M (\beta_jk)2 \right]^{1/2} \\
\end{aligned}
- $\parallel \beta_j \parallel_2 = \left[ \sum_{k=1}^M (\beta_jk)2 \right]^{1/2}$
- 马双鸽等人, 2015:
- 同构模型的变量选择为整组选择
- 同构数据的惩罚整合分析思想与单个数据集下的组变量选择类似
- Zhang等人, 2015:在一定条件下,Group LASSO、$L_2$ Group SCAD、$L_2$ Group MCP满足选择一致性
Adaptive Group LASSO
Wang, H. , & Leng, C. . (2008). A note on adaptive group lasso. Computational Statistics & Data Analysis, 52(12), 5277-5286.
Group SCAD
Lifeng Wang, Guang Chen, Hongzhe Li, Group SCAD regression analysis for microarray time course gene expression data, Bioinformatics, Volume 23, Issue 12, 15 June 2007, Pages 1486–1494
$L_2$ Group Bridge
- Ma, S., Huang, J., & Song, X. (2011). Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics (Oxford, England), 12(4), 763–775.
- 马双鸽, 王小燕, 方匡南. 大数据的整合分析方法[J]. 统计研究, 2015, 32(011):3-11.
- 用于同构数据的整合分析
\begin{aligned}
\mbox{within-group selection} &: \mbox{Ridge penalty} \\
\mbox{group selection} &: \mbox{Bridge } \mathrm{penalty} \\
\end{aligned}
考虑有$M$组研究数据的样本数据(每组都含$p$个解释变量),第$j$个变量的系数$\beta_j$的惩罚函数为:
\begin{aligned}
J(\beta_j) &= \lambda \parallel \beta_j \parallel_2^{\gamma} \\
&= \lambda \sum_{j=1}^p \left\lbrace \left[ \sum_{k=1}^M (\beta_jk)2 \right] \right\rbrace^{\gamma} \\
\end{aligned}
- $\parallel \beta_j \parallel_2 = \left[ (\beta_j^1 )^2 + (\beta_j^2 )^2 + \cdots + (\beta_j^M )^2 \right]$
- $0 < \gamma < 1$是固定的bridge因子(bridge index)
- 将不同研究中的同一个研究变量的所有回归系数视为“一组”
- 当$\gamma = 1$时,$L_2$ Group Bridge简化为Group Lasso
- $L_2$ Group Bridge满足选择一致性
Zhang等人, 2015:在一定条件下,Group LASSO、$L_2$ Group SCAD、$L_2$ Group MCP满足选择一致性
$L_2$ Group MCP
- 可用于进行组变量选择
- 可用于同构数据的整合分析
- Huang, J. , Breheny, P. , & Ma, S. . (2012). A selective review of group selection in high-dimensional models. Statistical Science, 27(4), págs. 481-499.
- Ma, S., Huang, J., Wei, F., Xie, Y., & Fang, K. (2011). Integrative analysis of multiple cancer prognosis studies with gene expression measurements. Statistics in medicine, 30(28), 3361–3371.
- 马双鸽, 王小燕, 方匡南. 大数据的整合分析方法[J]. 统计研究, 2015, 32(011):3-11.
- Liu, J., Huang, J., & Ma, S. (2014). Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization. Scandinavian journal of statistics, theory and applications, 41(1), 87–103.
\begin{aligned}
\mbox{within-group selection} &: \mbox{Ridge penalty} \\
\mbox{group selection} &: \mathrm{MCP} \\
\end{aligned}
Ma等人, 2011:首次将$L_2$ Group MCP应用到整合分析
CAP
- 仅能选择组变量
Bi-Level Variable Selection 双层变量选择
Composite Penalization
Composite Penalization应用于异构整合分析:
$$J(\beta) = \sum_{j=1}^p p_{O, \lambda_{O} }\left( \sum_{k=1}^M p_{I, \lambda_{I} }(|\beta_j^k|) \right)$$
$L_1$ Group Bridge
- 最早的双层变量选择方法
- Huang, J., Ma, S., Xie, H., & Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
- 王小燕,谢邦昌,马双鸽,方匡南.高维数据下群组变量选择的惩罚方法综述[J].数理统计与管理,2015,34(06):978-988.
\begin{aligned}
\mbox{within-group selection} &: \mbox{LASSO } \mathrm{penalty} \\
\mbox{group selection} &: \mbox{Bridge penalty} \\
\end{aligned}
目标函数:
$$Q(\beta|X, y) = \parallel y-\sum_{k=1}^p x_k \beta_k \parallel_2^2 + \lambda_n \sum_{j=1}^J c_j\parallel \beta_{A_j} \parallel_1^{\gamma}, \qquad \lambda_n > 0$$
- $A_j(j=1,2,\cdots,J)$是$\lbrace1, 2, \cdots, p\rbrace$的任意子集
- $A_j(j=1,2,\cdots,J)$可以有交集
- 允许$\cup_{j=1}^J A_j$是$\lbrace1, 2, \cdots, p\rbrace$的真子集,不在$\cup_{j=1}^J A_j$中的变量不受惩罚
- Bridge惩罚项被应用到组系数的$L_1$范数中
- 当$\mid A_j\mid = 1(j=1,2,\cdots,J)$时,Group Bridge简化为标准Bridge
- 当$\gamma=1$,Group Bridge简化为标准LASSO,此时只能进行个体变量选择
- 当$0 < \gamma < 1$时,Group Bridge可用于同时进行组变量选择和个体变量选择
- 目标函数是非凸的,且在$\beta_j=0$处不可微
- Huang等人(2009):
- 首次提出$L_1$ Group Bridge
- 当$p\rightarrow \infty, n\rightarrow \infty$但$p < n$时,在某些正则条件下,$L_1$ Group Bridge($0 < \gamma < 1$)具有群组Oracle性质
$L_1$ Group MCP
Liu, J., Huang, J., & Ma, S. (2014). Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization. Scandinavian journal of statistics, theory and applications, 41(1), 87–103.
\begin{aligned}
\mbox{inner penalty} &: \mathrm{LASSO} \\
\mbox{outer penalty} &: \mbox{MCP} \\
\end{aligned}
Fan and Li, 2001:在单数据集变量选择中,LASSO倾向于选择过多的变量,理论上不满足Oracle性质
Composite MCP
- Liu, J., Huang, J., & Ma, S. (2014). Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization. Scandinavian journal of statistics, theory and applications, 41(1), 87–103.
- Zhang, Q. , Zhang, S. , Liu, J. , Huang, J. , & Ma, S. . (2015). Penalized integrative analysis under the accelerated failure time model. Statistica Sinica, 26(2).
\begin{aligned}
\mbox{inner penalty} &: \mathrm{MCP} \\
\mbox{outer penalty} &: \mbox{MCP} \\
\end{aligned}
- Zhang等人(2015):在一定条件下,Composite MCP在组内和组间均满足选择一致性,而$L_1$ Group MCP只满足组选择一致性
Sparse Group Penalization
稀疏组惩罚是两个惩罚函数的线性组合:一个具有组选择功能,另一个具有单变量选择功能(马双鸽等人, 2015)。
- 可应用于异构整合分析
惩罚函数的一般形式:
$$P(\beta;\lambda_1,\lambda_2) = \lambda_1 \sum_{j=1}^p P_1(\parallel\beta_j\parallel) + \lambda_2\sum_{j=1}{p}\sum_{k=1}M P_2(|\beta_j^k|)$$
- Zhang等人(2015):证明了Sparse Group Penalization的选择一致性
SGL-Sparse Group LASSO
SGL(Sparse Group LASSO)是LASSO和Group LASSO的线性组合
- J. Friedman and T. Hastie and R. Tibshirani. (2010).A note on the group lasso and a sparse group lasso. arXiv e-prints.
- Simon, N. , Friedman, J. , Hastie, T. , & Tibshirani, R. . (2013). A sparse-group lasso. Journal of Computational & Graphical Statistics, 22(2), 231-245.
- Vincent, M. , & Hansen, N. R. . (2014). Sparse group lasso and high dimensional multinomial classification. Computational Statistics & Data Analysis, 71(1), 771-786.
惩罚函数:
$$P_{SGL}(\beta;\lambda_1,\lambda_2) = \lambda_1\sum_{j=1}^J \parallel\beta_j\parallel + \lambda_2 \parallel\beta\parallel_1$$
$$\hat{\beta} = \arg \min_{\beta} \left\lbrace \parallel y - \sum_{j=1}^J \mathbf{X_j} \beta_j \parallel^2_2 + \lambda_1 \sum_{j=1}^J \sqrt{p_j} \parallel \beta_j \parallel_2 + \lambda_2 \parallel \beta \parallel_1 \right\rbrace$$
- $\mathbf{X_j}$是第$j$组变量组成的样本矩阵,维度为$n\times p_j$
- $\sum_{j=1}^J p_j = p$
- 当$\lambda_2 = 0$时,Sparse Group LASSO简化为Group LASSO
adSGL-Adaptive Sparse Group LASSO
- 结合adaptive LASSO和adaptive group LASSO
- 可以看作是SGL(Sparse Group LASSO)的改进版
- 使用由数据决定的权重提升预测效果
- Fang, K. , Wang, X. , Zhang, S. , Zhu, J. , & Ma, S. . (2015). Bi-level variable selection via adaptive sparse group lasso. Journal of Statistical Computation & Simulation, 85(13-15), 2750-2760.
惩罚函数:
$$P_{adSGL}(\beta;\lambda_1,\lambda_2) = \lambda_1 \sum_{j=1}^p w_j \parallel \beta_j \parallel_2 + \lambda_2 \xi^T \parallel \beta \parallel_1 $$
Sparse Group MCP
- Liu, J., Huang, J., Xie, Y., & Ma, S. (2013). Sparse group penalized integrative analysis of multiple cancer prognosis datasets. Genetics research, 95(2-3), 68–77.
- Zhang, Q. , Zhang, S. , Liu, J. , Huang, J. , & Ma, S. . (2015). Penalized integrative analysis under the accelerated failure time model. Statistica Sinica, 26(2).
Network-based penalization 网格结构惩罚
- Caiyan Li, Hongzhe Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, Volume 24, Issue 9, 1 May 2008, Pages 1175–1182.
- Li, C., & Li, H. (2010). Variable Selection and Regression Analysis for Graph-Stuctured Covariates with an Application to Genomics. The annals of applied statistics, 4(3), 1498–1516. https://doi.org/10.1214/10-AOAS332
- Kim, S., Pan, W., & Shen, X. (2013). Network-based penalized regression with application to genomic data. Biometrics, 69(3), 582–593.
SGLS-Sparse Group Laplacian Shrinkage
- 用于多源数据的整合分析
Liu, J., Huang, J., & Ma, S. (2013). Incorporating network structure in integrative analysis of cancer prognosis data. Genetic epidemiology, 37(2), 173–183.
$$\hat{\beta} = \arg \min_{\beta} \left\lbrace \frac{1}{n}L(\beta) + P_{\lambda, \gamma}(\beta) \right\rbrace$$
其中,
$$P_{\lambda, \gamma}(\beta) = \sum_{j=1}^p \rho(\parallel\beta_j\parallel_2; \sqrt{M_j}\lambda_1, \gamma) + \frac{1}{2}\lambda_2 d\sum_{1\leq j < k \leq p}a_{jk}\left( \frac{\parallel\beta_j\parallel_2}{\sqrt{M_j} } - \frac{\parallel\beta_k\parallel_2}{\sqrt{M_k} } \right)^2$$
- $\lambda = (\lambda_1, \lambda_2)$
- $\lambda_1 \geq 0$ 且 $\lambda_2 \geq 0$是调和参数(tuning parameter)
- $\gamma$是正则化参数
- $\rho(\cdot)$是MCP惩罚函数
- $M_j$是$\beta_j$的长度(size)
总结
单变量选择方法
方法 | 惩罚函数 | 参数 | 优点 | 缺点 |
---|---|---|---|---|
LASSO | $L_1$ | $\lambda \geq 0$ | 连续且稳健,对高维数据可降维 | - 不能实现组选择 - 不具有Oracle性质 |
SCAD | $L_1$ | $a > 2, \lambda > 0$ | - 继承LASSO优点 - 具有Oracle性质 |
- 不能处理$p\gg n$的数据 |
Bridge | ||||
MCP |
1 |
1 |