0%

Variable Selection | 带惩罚函数的变量选择方法

假设有$n$个样本,$p$个自变量(predictor variables)。
考虑简单线性回归模型:
$$y = X\beta + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2)$$
其中,$y$是响应变量/因变量/被解释变量,$X$是$n\times p$维的样本(设计)矩阵,$\varepsilon$是服从方差为$\sigma^2$的正态分布的误差项。

普通最小二乘(OLS,Ordinary Least Squares):

  • 损失函数(Loss function):
    $$L(\beta) = ||y-X\beta||^2$$
  • 回归系数估计:
    $$\hat{\beta}_{OLS} = \left(X^T X \right)^{-1} X^T y$$

$L_p$范数($L_p$-norm):
$$L_p(\vec{x}) = \parallel \vec{x} \parallel_p = \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p}, \qquad p\geq 1$$
其中,$\vec{x} = ( x_1, x_2, \cdots, x_n )$。

  • $p = -\infty$:
    $$\parallel \vec{x} \parallel_{-\infty} = \lim_{p \rightarrow -\infty} \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p} = \min_i \mid x_i \mid $$
  • $p = 0$:(严格说,不属于范数)
    $$\parallel \vec{x} \parallel_0 = \sharp(i) \quad \mathrm{with} \quad x_i\neq 0 $$
    表示向量$\vec{x}$中非零元素的个数
  • $p = 1$:(也称曼哈顿距离
    $$\parallel \vec{x} \parallel_1 = \sum_{i=1}^n \mid x_i \mid$$
  • $p = 2$:(也称欧氏距离
    $$\parallel \vec{x} \parallel_2 = \sqrt{\sum_{i=1}^n \mid x_i \mid^2}$$
  • $p = +\infty$:(无穷范数/最大范数)
    $$\parallel \vec{x} \parallel_{+\infty} = \lim_{p\rightarrow +\infty} \left( \sum_{i=1}^n \mid x_i \mid^p \right)^{1/p} = \max_i \mid x_i \mid$$
  • 当$p>n$时,OLS不适用(容易过拟合)

Penalized Least Squares

Li F R . Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties[J]. Publications of the American Statistical Association, 2001, 96(456):1348-1360.

将设计矩阵标准化,即$X^TX = I_p$

目标函数:
$$
\begin{aligned}
L( \beta ) &= \parallel y - X \beta \parallel_2^2 + \lambda J( \beta ) \\
&= \parallel y-X \beta \parallel_2^2 + \lambda \sum_{j=1}^p p_j ( \mid \beta_j \mid ) \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ ij } \beta_j \right)^2 + \lambda \sum_{j=1}^p p_j ( \mid \beta_j \mid ) \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ ij } \beta_j \right)^2 + \sum_{j=1}^p p_\lambda( \mid \beta_j \mid ) \\
&= \parallel y-X \beta \parallel^2_2 + \sum_{j=1}^p p_\lambda ( \mid \beta_j \mid ) \\
&= \parallel y-X \hat{ \beta }{ OLS } \parallel^2_2 + \parallel \hat{ \beta }{ OLS } - \beta \parallel^2 + \sum_{j=1}^p p_\lambda ( \mid \beta_j \mid ) \\
&= \parallel y - X \hat{ \beta }{ OLS } \parallel^2_2 + \sum{j=1}^p
\left\lbrace ( \hat{ \beta }{ oj } - \beta_j )^2 + p\lambda ( \mid \beta_j \mid ) \right\rbrace \\
\end{aligned}
$$

  • 假设惩罚函数对所有的系数都是一样的,即 $p(\mid \cdot \mid)$
  • 进一步,将 $\lambda p( \mid \cdot \mid )$ 记为 $p_\lambda( \mid \cdot \mid )$
  • $ \hat{ \beta }{ OLS } $ 是回归方程的普通最小二乘估计,$ \hat{ \beta }{ oj } $ 是 $\hat{ \beta }_{ OLS } $ 的第 $j$ 个元素

$$\hat{ \beta }_{ OLS } = \left( X^T X \right)^{-1} X^T y = X^T y$$

$\parallel y-X \hat{ \beta_{OLS} } \parallel^2_2$ 对于 $\beta$ 是常数,因此有

$$\arg \min_{\beta_j} \parallel y-X\beta \parallel^2_2 + p_\lambda ( \mid \beta_j \mid ) \Leftrightarrow \arg \min_{ \beta_j } \left\lbrace ( \hat{ \beta }{oj} - \beta_j )^2 + p\lambda( \mid \beta_j \mid ) \right\rbrace $$

$$\hat{ \beta }j = \arg \min{ \beta_j } \left\lbrace ( \hat{\beta}{oj} - \beta_j )^2 + p\lambda( \mid \beta_j \mid ) \right\rbrace$$

  • 对于岭回归(Ridge Regression),

$$
\begin{aligned}
p_\lambda ( \mid \beta_j \mid ) &= \lambda \mid \beta_j \mid^2 \hat{ \beta }^{ Ridge }{j} \\
&= \frac{ 1 }{ 1 + \lambda } \hat{ \beta }
{oj} \\
\end{aligned}
$$

  • 对于LASSO,

$$
\begin{aligned}
p_\lambda(\mid \beta_j \mid) &= \lambda \mid \beta_j \mid \hat{ \beta }^{ LASSO }{j} \\
&= \mathrm{ sgn }( \hat{ \beta }
{oj} )
\left( \mid \hat{ \beta }{oj} \mid - \frac{ \lambda }{ 2 } \right )+ \\
\end{aligned}
$$

Rigde Regression 岭回归

当自变量个数超过样本个数 或 样本数据存在多重共线性(multicollinearity)时,使用岭回归(Ridge Regression)可以得到一个精简的模型。

Hoerl, A., & Kennard, R. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics, 12(1), 69-82. doi:10.2307/1267352

  • ridge estimator是一种shrinkage estimator

Shrinkage estimators theoretically produce new estimators that are shrunk closer to the ‘true’ population parameters.

  • ridge regression是Tikhivov方法的其中一种

  • ridge regression使用L2正则化(L2 Regularization)

  • 目标函数:

\begin{aligned}
L_{Ridge}(\beta) &= ||y-X\beta ||^2_2 + \lambda ||\beta||^2_2 \\
&= \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \\
\end{aligned}

  • 回归系数估计:

\begin{aligned}
\hat{\beta}{Ridge} &= \arg \min\beta \sum_{i=1}^n \left(y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 \quad \mathrm{s.t.} \sum_{j=1}^p \beta_j^2 \leq s \\
&= \left( X^T X+\lambda I \right)^{-1} X^T y \\
\end{aligned}
其中,$I$是单位矩阵,$\lambda$是正则惩罚参数。

  • 当$\lambda \rightarrow 0$时,有 $\hat{\beta}{Ridge} \rightarrow \hat{\beta}{OLS}$
  • 当$\lambda \rightarrow \infty$时,有 $\hat{\beta}_{Ridge} \rightarrow 0$

Bias-Variance Trade-off

\begin{aligned}
\mathrm{Bias}( \hat{\beta}{Ridge} ) &= -\lambda \left( X^T X+\lambda I \right)^{-1} \beta \\
\mathrm{Var}( \hat{\beta}
{Ridge} ) &= \sigma^2 \left( X^T X + \lambda I \right)^{-1} X^T X \left( X^T X+\lambda I \right)^{-1} \\
\end{aligned}

如何选择$\lambda$

方法:

  1. 最小化信息准则(Minimizing Information Criteria):
  2. 交叉验证(cross-validation)

R实现

可使用R包glmnet

  • 注意:ridge regression假设自变量(predictors)是标准化的(standardized),且因变量是中心化的(centered)

标准化、中心化见《Post not found: 算法-特征归一化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(glmnet)  ## 用于岭回归
library(dplyr) ## 用于数据清洗
library(psych) ## 使用 tr() 计算矩阵的迹

data('mtcars') ## 载入数据mtcars
## 将y中心化
y <- mtcars %>% select(mpg) %>%
scale(center = TRUE, scale = FALSE) %>%
as.matrix()
## 将X中心化
X <- mtcars %>% select(-mpg) %>%
scale(center = TRUE, scale = TRUE) %>%
as.matrix()
## X的标准化可以省略,令cv.glmnet()或glmnet()中的参数standardize = TRUE来标准化X

## 使用10折交叉验证选择最优lambda
lambdas_seq <- 10^seq(-3, 5, length.out = 100)
## alpha = 0表示岭回归
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_seq,
nfolds = 10)
plot(ridge_cv)
1
2
3
4
## 使用令交叉验证预测误差均值最小的lambda
rr <- glmnet(X, y, alpha = 0, lambda = ridge_cv$lambda.min)
y_hat <- predict(rr, X) ## y的估计值
ssr <- t(y - y_hat) %*% (y - y_hat)

Python实现

LASSO

LASSO(Least Absolute Shrinkage and Selection Operator)

  • LASSO在损失函数中加上一个L1惩罚项(L1 penalty)

  • 目标函数:
    \begin{aligned}
    L_{LASSO}(\beta) &= || y-X \beta ||^2_2 + \lambda ||\beta||1 \\
    &= \sum
    {i=1}^n \left(y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^p \mid \beta_j \mid \\
    \end{aligned}

$$\hat{\beta}{LASSO} = \arg \min{\beta} \left\lbrace \sum_{i=1}^n\left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 \right\rbrace \qquad \mbox{subject to } \sum_{j=1}^p |\beta_j| \leq s$$

  • 标准LASSO过度惩罚了大系数(standard LASSO over-shrinks large coefficients due to the nature of $l_1$ penalty)。
  • Fan and Li, 2001:在单数据集变量选择中,LASSO倾向于选择过多的变量,理论上不满足Oracle性质

缺点

  1. LASSO最终选择的变量个数$c$不会超过样本个数$n$($c = \min\lbrace n, p \rbrace$)
  2. LASSO在对高度相关的变量进行选择时,只选择其中一个,而不关心是哪一个

Fused LASSO

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005), Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67: 91-108.

$$\hat{\beta}{fLASSO} = \arg \min{\beta} \left\lbrace \sum_{i=1}^n\left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 \right\rbrace \qquad \mbox{subject to } \sum_{j=1}^p |\beta_j| \leq s_1 \mbox{ and } \sum_{j=2}^p |\beta_j - \beta_{j-1}| \leq s_2$$

  • $\sum_{j=1}^p |\beta_j| \leq s_1$鼓励回归系数的稀疏性
  • $\sum_{j=2}^p |\beta_j - \beta_{j-1}| \leq s_2$鼓励回归系数差异上的稀疏性

Adaptive LASSO

Hui Zou (2006). The Adaptive Lasso and Its Oracle Properties, Journal of the American Statistical Association, 101:476, 1418-1429.

考虑加权LASSO
$$\arg \min_{\beta} \parallel y - \sum_{j=1}^p x_j\beta_j \parallel^2_2 + \lambda \sum_{j=1}^p w_j|\beta_j|$$

  • 如果权重$w_j$是基于数据、且被正确选择,则加权LASSO具有Oracle性质,可称为自适应LASSO(Adaptive LASSO)

定义$\hat{w}=\frac{1}{|\hat{\beta}|^{\gamma} }$自适应LASSO通过下式估计$\hat{\beta}^{(n)}$
$$\hat{\beta}^{
(n)} = \arg \min_{\beta} \parallel y-\sum_{j=1}^p x_j\beta_j \parallel^2_2 + \lambda_n \sum_{j=1}^p \hat{w}_j |\beta_j|$$

  • $\hat{\beta}$可取$\hat{\beta}_{OLS}$

post-LASSO

Belloni, Alexandre; Chernozhukov, Victor. Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2013), no. 2, 521–547. doi:10.3150/11-BEJ410.

Bridge

Frank L E , Friedman J H . A Statistical View of Some Chemometrics Regression Tools[J]. Technometrics, 1993, 35(2):109-135.
Fu, Wenjiang J . Penalized Regressions: The Bridge versus the Lasso[J]. Journal of Computational and Graphical Statistics, 1998, 7(3):397.

  • 目标函数:
    $$L_{Bridge}(\beta) = \parallel y-X\beta \parallel_2 + \lambda \sum_{i=1}^p \mid \beta_i \mid^\gamma, \qquad \gamma > 0 $$

$$
\begin{aligned}
\hat{\beta}{Bridge} &= \arg \min{\beta} \parallel y - X\beta \parallel_2^2 + \lambda J(\beta) \\
&= \arg \min_{\beta} \parallel y - X\beta \parallel_2^2 + \lambda \parallel \beta \parallel_{\gamma}^{\gamma} \\
&= \arg \min_{\beta} \parallel y - X\beta \parallel_2^2 + \lambda \sum_{j=1}^p \mid \beta_j\mid^{\gamma} \\
\end{aligned}
$$

Zou and Hastie(2005):Bridge估计量可以看成是基于如下先验的贝叶斯后验模式(Bayes posterior mode)
$$p_{\lambda, \gamma}(\beta) = C(\lambda, \gamma)\exp{(-\lambda |\beta|_{\gamma}^{\gamma})} $$

SCAD

SCAD(Smoothly Clipped Absolute Deviation)

Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348-1360.

目标函数:
$$ P_{\lambda}(|\beta_j|) = \left\lbrace
\begin{array}{ll}
{\lambda} |\beta_j|, & 0\leq |\beta_j| \leq \lambda \\
-\frac{\left( |\beta_j|^2 - 2a\lambda |\beta_j| + \lambda^2 \right)}{2(a-1)}, & \lambda < |\beta_j| \leq a\lambda \\
\frac{(a+1)\lambda^2}{2}, & |\beta_j| > a\lambda \\
\end{array}
\right.
$$

$$ P_{\lambda}^{\prime} (|\beta_j|) = \lambda \left\lbrace I(|\beta_j| \leq \lambda) + \frac{(a\lambda - |\beta_j|)_{+} }{(a-1)\lambda} I(|\beta_j| > \lambda) \right\rbrace$$
其中$a>2$。

Elastic Net(Enet)

  • EN是最早的具有群组变量选择功能的方法(王小燕等, 2015)
  • 惩罚函数是LASSO和岭回归的线性组合
  • 处理高度相关数据的组变量选择方法
  • Zou, H. , & Hastie, T. . (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 67(5), 768-768.
  • 王小燕,谢邦昌,马双鸽,方匡南.高维数据下群组变量选择的惩罚方法综述[J].数理统计与管理,2015,34(06):978-988.

目标函数:
\begin{aligned}
L(\lambda, \beta) &= \parallel y-X\beta \parallel_2^2 + \lambda_2 \parallel \beta\parallel 2^2 + \lambda_1 \parallel \beta \parallel_1 \\
&= \parallel y-X\beta \parallel_2^2 + \lambda_2 \sum
{j=1}^p \beta_j^2 + \lambda_1 \sum_{j=1}^p |\beta_j| \\
\end{aligned}

\begin{aligned}
\hat{\beta} &= \arg\min_{\beta} \lbrace L(\lambda, \beta) \rbrace \\
&= \arg\min_{\beta} \parallel y-X\beta \parallel_2^2, \qquad \mbox{subject to } (1-\alpha)\mid\beta\mid_1+\alpha\parallel \beta\parallel^2_2 \leq t \mbox{ for some } t.\\
\end{aligned}

  • 当$\alpha = 1$时,朴素弹性网(naive elastic net)变为简单岭回归(simple ridge regression)
  • 当$\alpha = 0$时,朴素弹性网(naive elastic net)变为LASSO

Elastic Net估计量可以看成是如下先验(介于Gaussian先验和Laplacian先验)的贝叶斯后验模式:
$$p_{\lambda, \alpha} = C(\lambda, \alpha) \exp\lbrace -\lambda \left[ \alpha |\beta|^2 + (1-\alpha)|\beta|_1 \right] \rbrace$$

EN vs Bridge

  • $1< \gamma < 2$的Bridge回归与Elastic Net有许多相似之处
  • Elastic Net可以产生稀疏解,而Bridge Regression不能

Fan and Li (2001):在$L_q(q\geq 1)$的惩罚函数族中,只有LASSO惩罚函数($q=1$)能够产生稀疏解(sparse solution)

Adaptive Elastic Net

结合Adaptive LASSO和$L_2$惩罚项

MCP

MCP(minimax concave penalty)

  • 用于单一数据集的分析

MCP定义为:
$$
\begin{aligned}
\rho(t;\lambda,\gamma) &= \lambda \int_{0}^{|t|} \left( 1-\frac{x}{\gamma \lambda} \right){+} \mathrm{d}x \\
&= \lambda \left( |t| - \frac{t^2}{2\gamma\lambda} \right)I(|t| < \gamma\lambda) + \frac{\gamma\lambda^2}{2}I(|t| \geq \gamma\lambda) \\
\end{aligned}
$$

$$ P
{MCP}(t;\lambda, \gamma) =
\left\lbrace
\begin{array}{ll}
\lambda|t| - \frac{t^2}{2\gamma}, & |t| < \gamma\lambda \\
\frac{\gamma\lambda^2}{2}, & |t| \geq \gamma \lambda \\
\end{array}
\right.
$$

  • 正则化参数$\gamma > 0$控制$\rho(\cdot)$的凹度(concavity)
  • 惩罚参数:$\lambda$
  • $x_{+} = xI(x\geq 0)$

导数为:
$$\rho^\prime(t;\lambda,\gamma) = \lambda\left( 1-\frac{|t|}{\gamma \lambda} \right){+}\mathrm{sgn}(t)$$

$$
P
{MCP}^{\prime}(t;\lambda,\gamma) = \left\lbrace
\begin{array}{ll}
\lambda - \frac{|t|}{\gamma}, & |t| < \gamma\lambda \\
0, & |t| \geq \gamma \lambda \\
\end{array}
\right.
$$

其中
$$\mathrm{sgn}(t) = \left\lbrace
\begin{array}{ll}
-1, & t < 0 \\
0, & t = 0 \\
1, & t > 0
\end{array}
\right.
$$

  • 当$\gamma \rightarrow +\infty$,有MCP penalty $\rightarrow$ LASSO penalty
  • 当$\gamma \rightarrow 1+$,有MCP penalty $\rightarrow$ hard-thresholding penalty

Group Variable Selection

MNET

  • 处理高度相关数据的组变量选择方法

$L_2$ SCAD

  • 组合SCAD函数和岭回归
  • 处理高度相关数据的组变量选择方法

Group LASSO

  • 仅能选择组变量

假设设计矩阵$X$可分为$J$个组$X_1, X_2, \cdots, X_J$,且$d_j$表示第$j$个组的大小(size),也即$\sum_{j=1}^Jd_j = p$。

目标函数:
\begin{equation}
\begin{aligned}
Q(\beta | X,y) &= L(\beta|X,y) + \sum_{j=1}^J \lambda_j {\parallel} \beta_j {\parallel_{K_j} } \qquad (\lambda \geq 0) \\
&= \parallel y - \sum_{j=1}^JX_j\beta_j {\parallel_2^2} + \lambda \sum_{j=1}^J \sqrt{d_j} \parallel \beta_j {\parallel_{K_j} } \\
\end{aligned}
\end{equation}

  • ${\parallel}z{\parallel_{K_j} } = \left( z^TK_jz \right)^{1/2}$
  • 为了保证对大组、小组的惩罚力度一致,可取$\lambda_j = \lambda \sqrt{d_j}$
  • 当$d_j=1(1\leq j\leq J)$时,Group LASSO简化为标准LASSO,且$R_j=\frac{1}{n}\parallel X_j\parallel^2$正比于$X_j$的样本方差。
  • Kim等(2006):将Group LASSO应用到Logistic模型中
  • Meier等(2008):将Group LASSO应用到Logistic模型中
  • Huang等(2009):Group LASSO就像是自适应加权岭回归(adaptively weighted ridge regression)
  • Wei and Huang (2010):Group LASSO的选择没有一致性,且倾向于选取不重要的组

如何选择$K_j$

对于标准正交的$X_j$,有
$$\frac{1}{n}X_j^TX_j = I_{d_j}, \qquad j=1, 2, \cdots, J.$$

求解

  • Yuan and Lin(2006):通过组坐标下降法(Group Coordinate Descent Algorithm)计算Group LASSO的解

令$z=\frac{1}{n}X_j^T y$为$y=X\beta+\varepsilon$的最小二乘解,则有
$$\hat{\beta}{LASSO}(z;\lambda) = S(z, \lambda) = \left( 1-\frac{\lambda}{\parallel z \parallel_2} \right){+} z$$

同构整合分析

Group LASSO可应用于同构整合分析

Zhang, Q. , Zhang, S. , Liu, J. , Huang, J. , & Ma, S. . (2015). Penalized integrative analysis under the accelerated failure time model. Statistica Sinica, 26(2).

目标函数:
\begin{aligned}
L(\beta) &= \frac{1}{2n}\parallel y-X\beta \parallel_2^2 + \lambda \sum_{j=1}^p\parallel \beta_j \parallel_2 \\
&= \frac{1}{2n}\parallel y-X\beta \parallel_2^2 + \lambda \sum_{j=1}^p \left[ \sum_{k=1}^M (\beta_jk)2 \right]^{1/2} \\
\end{aligned}

  • $\parallel \beta_j \parallel_2 = \left[ \sum_{k=1}^M (\beta_jk)2 \right]^{1/2}$
  • 马双鸽等人, 2015
    • 同构模型的变量选择为整组选择
    • 同构数据的惩罚整合分析思想与单个数据集下的组变量选择类似
  • Zhang等人, 2015:在一定条件下,Group LASSO、$L_2$ Group SCAD、$L_2$ Group MCP满足选择一致性

Adaptive Group LASSO

Wang, H. , & Leng, C. . (2008). A note on adaptive group lasso. Computational Statistics & Data Analysis, 52(12), 5277-5286.

Group SCAD

Lifeng Wang, Guang Chen, Hongzhe Li, Group SCAD regression analysis for microarray time course gene expression data, Bioinformatics, Volume 23, Issue 12, 15 June 2007, Pages 1486–1494

$L_2$ Group Bridge

  • 用于同构数据的整合分析

\begin{aligned}
\mbox{within-group selection} &: \mbox{Ridge penalty} \\
\mbox{group selection} &: \mbox{Bridge } \mathrm{penalty} \\
\end{aligned}

考虑有$M$组研究数据的样本数据(每组都含$p$个解释变量),第$j$个变量的系数$\beta_j$的惩罚函数为:
\begin{aligned}
J(\beta_j) &= \lambda \parallel \beta_j \parallel_2^{\gamma} \\
&= \lambda \sum_{j=1}^p \left\lbrace \left[ \sum_{k=1}^M (\beta_jk)2 \right] \right\rbrace^{\gamma} \\
\end{aligned}

  • $\parallel \beta_j \parallel_2 = \left[ (\beta_j^1 )^2 + (\beta_j^2 )^2 + \cdots + (\beta_j^M )^2 \right]$
  • $0 < \gamma < 1$是固定的bridge因子(bridge index)
  • 将不同研究中的同一个研究变量的所有回归系数视为“一组”
  • 当$\gamma = 1$时,$L_2$ Group Bridge简化为Group Lasso
  • $L_2$ Group Bridge满足选择一致性

Zhang等人, 2015:在一定条件下,Group LASSO、$L_2$ Group SCAD、$L_2$ Group MCP满足选择一致性

$L_2$ Group MCP

  • 可用于进行组变量选择
  • 可用于同构数据的整合分析

\begin{aligned}
\mbox{within-group selection} &: \mbox{Ridge penalty} \\
\mbox{group selection} &: \mathrm{MCP} \\
\end{aligned}

Ma等人, 2011:首次将$L_2$ Group MCP应用到整合分析

CAP

  • 仅能选择组变量

Bi-Level Variable Selection 双层变量选择

Composite Penalization

Composite Penalization应用于异构整合分析:
$$J(\beta) = \sum_{j=1}^p p_{O, \lambda_{O} }\left( \sum_{k=1}^M p_{I, \lambda_{I} }(|\beta_j^k|) \right)$$

$L_1$ Group Bridge

  • 最早的双层变量选择方法
  • Huang, J., Ma, S., Xie, H., & Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
  • 王小燕,谢邦昌,马双鸽,方匡南.高维数据下群组变量选择的惩罚方法综述[J].数理统计与管理,2015,34(06):978-988.

\begin{aligned}
\mbox{within-group selection} &: \mbox{LASSO } \mathrm{penalty} \\
\mbox{group selection} &: \mbox{Bridge penalty} \\
\end{aligned}

目标函数:
$$Q(\beta|X, y) = \parallel y-\sum_{k=1}^p x_k \beta_k \parallel_2^2 + \lambda_n \sum_{j=1}^J c_j\parallel \beta_{A_j} \parallel_1^{\gamma}, \qquad \lambda_n > 0$$

  • $A_j(j=1,2,\cdots,J)$是$\lbrace1, 2, \cdots, p\rbrace$的任意子集
  • $A_j(j=1,2,\cdots,J)$可以有交集
  • 允许$\cup_{j=1}^J A_j$是$\lbrace1, 2, \cdots, p\rbrace$的真子集,不在$\cup_{j=1}^J A_j$中的变量不受惩罚
  • Bridge惩罚项被应用到组系数的$L_1$范数中
  • 当$\mid A_j\mid = 1(j=1,2,\cdots,J)$时,Group Bridge简化为标准Bridge
  • 当$\gamma=1$,Group Bridge简化为标准LASSO,此时只能进行个体变量选择
  • 当$0 < \gamma < 1$时,Group Bridge可用于同时进行组变量选择和个体变量选择
  • 目标函数是非凸的,且在$\beta_j=0$处不可微
  • Huang等人(2009)
    • 首次提出$L_1$ Group Bridge
    • 当$p\rightarrow \infty, n\rightarrow \infty$但$p < n$时,在某些正则条件下,$L_1$ Group Bridge($0 < \gamma < 1$)具有群组Oracle性质

$L_1$ Group MCP

Liu, J., Huang, J., & Ma, S. (2014). Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization. Scandinavian journal of statistics, theory and applications, 41(1), 87–103.

\begin{aligned}
\mbox{inner penalty} &: \mathrm{LASSO} \\
\mbox{outer penalty} &: \mbox{MCP} \\
\end{aligned}

Fan and Li, 2001:在单数据集变量选择中,LASSO倾向于选择过多的变量,理论上不满足Oracle性质

Composite MCP

\begin{aligned}
\mbox{inner penalty} &: \mathrm{MCP} \\
\mbox{outer penalty} &: \mbox{MCP} \\
\end{aligned}

  • Zhang等人(2015):在一定条件下,Composite MCP在组内和组间均满足选择一致性,而$L_1$ Group MCP只满足组选择一致性

Sparse Group Penalization

稀疏组惩罚是两个惩罚函数的线性组合:一个具有组选择功能,另一个具有单变量选择功能(马双鸽等人, 2015)。

  • 可应用于异构整合分析

惩罚函数的一般形式:
$$P(\beta;\lambda_1,\lambda_2) = \lambda_1 \sum_{j=1}^p P_1(\parallel\beta_j\parallel) + \lambda_2\sum_{j=1}{p}\sum_{k=1}M P_2(|\beta_j^k|)$$

SGL-Sparse Group LASSO

SGL(Sparse Group LASSO)是LASSO和Group LASSO的线性组合

惩罚函数:
$$P_{SGL}(\beta;\lambda_1,\lambda_2) = \lambda_1\sum_{j=1}^J \parallel\beta_j\parallel + \lambda_2 \parallel\beta\parallel_1$$

$$\hat{\beta} = \arg \min_{\beta} \left\lbrace \parallel y - \sum_{j=1}^J \mathbf{X_j} \beta_j \parallel^2_2 + \lambda_1 \sum_{j=1}^J \sqrt{p_j} \parallel \beta_j \parallel_2 + \lambda_2 \parallel \beta \parallel_1 \right\rbrace$$

  • $\mathbf{X_j}$是第$j$组变量组成的样本矩阵,维度为$n\times p_j$
  • $\sum_{j=1}^J p_j = p$
  • 当$\lambda_2 = 0$时,Sparse Group LASSO简化为Group LASSO

adSGL-Adaptive Sparse Group LASSO

  • 结合adaptive LASSO和adaptive group LASSO
  • 可以看作是SGL(Sparse Group LASSO)的改进版
  • 使用由数据决定的权重提升预测效果

惩罚函数:
$$P_{adSGL}(\beta;\lambda_1,\lambda_2) = \lambda_1 \sum_{j=1}^p w_j \parallel \beta_j \parallel_2 + \lambda_2 \xi^T \parallel \beta \parallel_1 $$

Sparse Group MCP

Network-based penalization 网格结构惩罚

SGLS-Sparse Group Laplacian Shrinkage

  • 用于多源数据的整合分析

Liu, J., Huang, J., & Ma, S. (2013). Incorporating network structure in integrative analysis of cancer prognosis data. Genetic epidemiology, 37(2), 173–183.

$$\hat{\beta} = \arg \min_{\beta} \left\lbrace \frac{1}{n}L(\beta) + P_{\lambda, \gamma}(\beta) \right\rbrace$$
其中,
$$P_{\lambda, \gamma}(\beta) = \sum_{j=1}^p \rho(\parallel\beta_j\parallel_2; \sqrt{M_j}\lambda_1, \gamma) + \frac{1}{2}\lambda_2 d\sum_{1\leq j < k \leq p}a_{jk}\left( \frac{\parallel\beta_j\parallel_2}{\sqrt{M_j} } - \frac{\parallel\beta_k\parallel_2}{\sqrt{M_k} } \right)^2$$

  • $\lambda = (\lambda_1, \lambda_2)$
  • $\lambda_1 \geq 0$ 且 $\lambda_2 \geq 0$是调和参数(tuning parameter)
  • $\gamma$是正则化参数
  • $\rho(\cdot)$是MCP惩罚函数
  • $M_j$是$\beta_j$的长度(size)

总结

单变量选择方法

方法 惩罚函数 参数 优点 缺点
LASSO $L_1$ $\lambda \geq 0$ 连续且稳健,对高维数据可降维 - 不能实现组选择
- 不具有Oracle性质
SCAD $L_1$ $a > 2, \lambda > 0$ - 继承LASSO优点
- 具有Oracle性质
- 不能处理$p\gg n$的数据
Bridge
MCP
1

1

参考资料

Thank you for your approval.

欢迎关注我的其它发布渠道