线性回归
自变量$x$和因变量$y$之间存在线性相关关系
变量别名
自变量 | 因变量 |
---|---|
input variable | outcome variable |
explanatory variable | |
regressor | regressand |
independent variable | dependent variable |
feature | target |
predictor variable | |
exogenous variable | endogenous variable |
criterion variable |
用途
- 确定自变量对因变量影响的强度
to identify the strength of the effect that the independent variable(s) have on a dependent variable - 预测效应
to forecast effects or impact of changes - 预测趋势
to predict trends and future values
类型
线性回归的类型有:
- Simple linear regression
dependent variable | independent variable | |
---|---|---|
amount | 1 | 1 |
type | interval / ratio | interval / ratio / dichotomous[1] |
- Multiple linear regression
dependent variable | independent variable | |
---|---|---|
amount | 1 | 2+ |
type | interval / ratio | interval / ratio / dichotomous |
- Logistic regression
dependent variable | independent variable | |
---|---|---|
amount | 1 | 2+ |
type | dichotomous | interval / ratio / dichotomous |
- Ordinal regression
dependent variable | independent variable | |
---|---|---|
amount | 1 | 1+ |
type | ordinal[^2] | nominal[^3] or dichotomous |
[^2]: ordinal:有序,序数 | ||
[^3]: nominal:名义 |
- Multinominal regression
dependent variable | independent variable | |
---|---|---|
amount | 1 | 1+ |
type | nominal | interval / ratio / dichotomous |
- Discriminant analysis
判别分析
dependent variable | independent variable | |
---|---|---|
amount | 1 | 1+ |
type | nominal | interval / ratio |
基本模型
$$y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p+\varepsilon$$
- 因变量(regressand):$y$
- 自变量(regressor):$x_1,\cdots,x_p$
- 是确定性变量,不是随机变量
- 随机误差项:$\varepsilon$
- 截距项(intercept):$\beta_0$
- 回归系数(regression coefficient)
考虑$n$个样本、$p$个自变量$(\mathbf{X}_1,\mathbf{X}_2,\cdots, \mathbf{X}_p)$
\begin{equation}
\textbf{y}=\textbf{X} \mathbf{\beta} + \mathbf{\varepsilon}, \qquad \mathbf{\varepsilon} \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})
\end{equation}
其中
\begin{equation}
\mathbf{X} = (\mathbf{1}, \mathbf{X}1, \mathbf{X}2, \cdots, \mathbf{X}p) =
\left(
\begin{array}{cccc}
1 & X{11} & \cdots & X{1p} \\
1 & X{21} & \cdots & X_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & X_{n1} & \cdots & X_{np}
\end{array}
\right)
\end{equation}
try
$$
\mathbf{\beta} =
\left(
\begin{array}{c}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_p
\end{array}
\right)
,\qquad \mathbf{\varepsilon}=
\left(
\begin{array}{c}
\varepsilon_1 \\
\varepsilon_2 \\
\vdots \\
\varepsilon_n
\end{array}
\right)
$$
基本假设
- 自变量和因变量线性相关
- 自变量之间相互独立
- 解释变量之间不存在(完全的)线性关系
$$\mathrm{rank}(X)=p$$ - 若不满足,则模型具有多重共线性(Multicollinearity)
- 解释变量之间不存在(完全的)线性关系
- 随机误差项相互独立
- 若不满足,则模型具有自相关性(Autocorrelation)
- $\varepsilon\sim i.i.d. N(0,\sigma^2)$
- 随机误差项服从均值为零、方差为常数的正态分布
- 随机误差项独立同分布(i.i.d.,independent and identical distribution)
- 随机误差项同方差(Homoskedasticity);若不满足,则存在异方差性(Heteroskedasticity)
- 自变量和误差项之间相互独立
参数估计
Linear regression:
$$h_\beta(x)=\sum_{j=0}^p\beta_jx_j$$
where $x_0=1$.
Cost function:
$$ J( \beta ) = \frac{1}{2n} \sum_{i=1}^n \left( h_\beta(x^{(i)})- y^{(i)} \right)^2$$
Objective function:
$$\min_\beta J(\beta)$$
(Batch) Gradient Descent
(全批量)梯度下降法
Linear Regression with (Batch) Gradient Descent:
repeat until convergence $\lbrace$
$$ \beta_j := \beta_j - \alpha \frac{1}{n} \sum_{i=1}^n \left( h_\beta(x^{(i)} ) - y^{(i)} \right) x_j^{(i)} $$
for $\forall$ $j=0,1,\cdots,p$.
$\rbrace$
其中
$\alpha$为学习率/学习步长(learning rate):
- 决定了在每一步梯度下降迭代过程中,
- 每一步沿梯度负方向前进的长度(如果是最小化目标函数)
- 每一步沿梯度正方向前进的长度(如果是最大化目标函数)
- 学习率过小,需要迭代的步数较多,需要花费较多的学习时间
- 学习率过大,会导致迭代过快,容易出现点在最优点的左右反复横跳(可能错过最优点)
- 需要经过多次试验,选取较优的学习率/步长
learning rate, $\alpha$, basically controls how big step we take downhill with gradient descent. If $\alpha$ is large, then that corresponds to a very aggressive gradient descent procedure, where we’re trying to take huge steps downhill. And if $\alpha$ is very small, then we’re taking little, little baby steps downhill.——吴恩达-机器学习
正则化
若训练模型过于复杂,容易出现过拟合(overfitting)问题。
解决方法:
- 减少特征的数量
- 降低特征的权重,即正则化
正则化(Regularization):
原目标函数为
$$\hat{\beta}=\arg\min_\beta \left\lbrace \sum_{i=1}^n( y_i - \sum_{j=0}^p \beta_j x_{ij})^2 \right\rbrace$$
在原始目标函数上添加惩罚项(penalty term)
$$\hat{\beta}=\arg\min_\beta \left\lbrace \sum_{i=1}^n( y_i - \sum_{j=0}^p \beta_j x_{ij})^2 + \lambda g(\beta) \right\rbrace$$
岭回归
Ridge regression
$$\hat{\beta}^{ridge} = \arg\min_\beta \left\lbrace \sum_{i=1}^n (y_i-\sum_{j=0}^p \beta_j x_{ij})^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\rbrace$$
- 若自变量间存在严重的共线性问题,普通最小二乘法并不适用,可以选用岭回归方法
Lasso
$$\hat{\beta}^{lasso}=\arg\min_\beta\left\lbrace \sum_{i=1}n(y_i-\sum_{j=0}p\beta_j x_{ij})^2+\lambda \sum_{j=1}^p|\beta_j|\right\rbrace$$
Elastic Net
$$\hat{\beta}^{EN}= \arg\min_\beta \left\lbrace \sum_{i=1}^n (y_i - \sum_{j=0}^p \beta_j x_{ij})^2 + \lambda_2 \sum_{j=1}^p \beta_j^2+\lambda_1 \sum_{j=1}^p | \beta_j | \right\rbrace$$
残差
$$e_i=y_i-\hat{y}_i$$
相关检验
DW检验
用于检验残差是否具有自相关性
Durbin-Watson统计量
$$DW=\frac{\sum_{i=2}^n (e_i-e_{i-1})^2 }{\sum_{i=1}^n e_i^2}$$
残差检验
平方和
1. 总平方和
- 总平方和
- 总离差平方和
- SST (Sum of Squares Total)
- TSS (Total Sum of Squares)
$$\mathbf{\sum_{i=1}^n (y_i-\bar{y})^2}$$
2. 回归平方和
- 回归平方和
- 解释平方和
- SSR (Sum of Squares Regression)
- ESS (Explained Sum of Squares)
$$\mathbf{\sum_{i=1}n(\hat{y}_i-\bar{y})2}$$
3. 残差平方和
- 残差平方和
- SSE (Sum of Squared estimate of Errors/ Sum of Squares Error)
- RSS (Residual Sum of Squares)
- SSR (Sum of Squared Residuals)
$$\mathbf{\sum_{i=1}n(y_i-\hat{y}_i)2}$$
SST=SSR+SSE
$$SST=SSR+SSE$$
Simple linear regression情况
$$y_i=\beta_0 +\beta_1 x_i+\varepsilon_i$$
\begin{aligned}
SST&=SSR+SSE\
SST&=\sum_{i=1}n(y_i-\bar{y})2\
&= \sum_{i=1}n(y_i-\hat{y}_i+\hat{y}_i-\bar{y})2\
&= \sum_{i=1}n(y_i-\hat{y}_i)2 + 2\sum_{i=1}^n (y_i-\hat{y}i)(\hat{y}i-\bar{y})+ \sum{i=1}n(\hat{y}_i-\bar{y})2\
&= \sum{i=1}n(y_i-\hat{y}_i)2 + \sum_{i=1}n(\hat{y}_i-\bar{y})2\
&= SSR + SSE
\end{aligned}
下面证明
$$\sum_{i=1}^n (y_i-\hat{y}i)(\hat{y}i-\bar{y})=0$$
在最小二乘法中,最小化$SSE$
$$(\hat{\beta}0,\hat{\beta}1)=\arg \min{\beta_0,\beta_1}\sum{i=1}n(y_i-\beta_0-\beta_1x_i)2$$
$SSE$分别关于$\beta_0,\beta_1$求偏导
\begin{aligned}
\frac{\partial{SSE} }{\partial{\beta_0} }&=\sum{i=1}^n2(y_i-\beta_0-\beta_1x_i)(-1)=0\
\frac{\partial{SSE} }{\partial{\beta_1} }&=\sum{i=1}^n2(y_i-\beta_0-\beta_1x_i)(-x_i)=0
\end{aligned}
所以有
$$\sum_{i=1}^n(y_i-\hat{\beta}_0-\hat{\beta}1x_i)=0$$
$$\sum{i=1}^n(y_i-\hat{\beta}_0-\hat{\beta}1x_i)x_i=0$$
因此
\begin{aligned}
\sum{i=1}^n (y_i-\hat{y}_i)(\hat{y}i-\bar{y})&=\sum{i=1}^n(y-\hat{\beta}_0-\hat{\beta}_1x_i)(\hat{\beta}_0+\hat{\beta}_1x_i-\bar{y})\
&=(\hat{\beta}0-\bar{y})\underline{\sum{i=1}n(y-\hat{\beta}_0-\hat{\beta}_1x_i)}+\hat{\beta}_1\underline{\sum_{i=1}n(y-\hat{\beta}_0-\hat{\beta}_1x_i)x_i}\
&=0
\end{aligned}Multiple linear regression情况
考虑$p$个自变量$({X}_1,{X}2,\cdots, {X}p)$
$$
\mathbf{y}=\mathbf{X}{\beta} + {\varepsilon}, \qquad {\varepsilon} \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})
$$
由最小二乘法得到的$\hat{\mathbf{\beta} }$满足
$$\hat{\mathbf{\beta} }=\arg \min{\mathbf{\beta} } \hat{\mathbf{\varepsilon} }^\prime \hat{\mathbf{\varepsilon} }=\arg \min{\mathbf{\beta} } (\mathbf{y}-\mathbf{X}\mathbf{\beta})^\prime(\mathbf{y}-\mathbf{X}\mathbf{\beta}) $$
$$\frac{\partial{\hat{\varepsilon}^\prime \hat{\varepsilon} }}{\partial{\beta} }=-2\mathbf{X}^\prime \mathbf{y}+2\mathbf{X}^\prime \mathbf{X}\mathbf{\beta}=\mathbf{0}$$
$\Longrightarrow$
$$\hat{\beta}=(\mathbf{X}\prime\mathbf{X}){-1}\mathbf{X}^\prime \mathbf{y}$$
因此
$$\hat{\mathbf{y} }=\mathbf{X}\hat{\mathbf{\beta} }=\mathbf{X}(\mathbf{X}\prime\mathbf{X}){-1}\mathbf{X}^\prime \mathbf{y}\triangleq \mathbf{H}\mathbf{y}$$
其中
$\mathbf{H}=\mathbf{X}(\mathbf{X}\prime\mathbf{X}){-1}\mathbf{X}^\prime$ 对称幂等,且$\mathbf{I}-\mathbf{H}$对称幂等。
\begin{aligned}
SST&=(\textbf{y}-\bar{\textbf{y} })^\prime(\textbf{y}-\bar{\textbf{y} })\
SSR&=(\hat{\textbf{y} }-\bar{\textbf{y} })^\prime(\hat{\textbf{y} }-\bar{\textbf{y} })\
SSE&=(\textbf{y}-\hat{\textbf{y} })^\prime(\textbf{y}-\hat{\textbf{y} })
\end{aligned}
定义$\textbf{1}_n=(1,1,\cdots,1)^\prime$为元素全是1的 $n$ 维列向量,则均值算子为$\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}$ (全是$\frac{1}{n}$ 的$n\times n$方阵),$\textbf{1}_n^\prime\textbf{1}_n=n$,且
$$\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)\prime = \left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right)$$
$$\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)\prime\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)=\textbf{I}-\frac{\textbf{1}_n\underline{\textbf{1}_n\prime\textbf{1}_n}\textbf{1}_n\prime}{n2}=\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right)$$
即$\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right)$ 是对称幂等阵。则有
\begin{aligned}
\textbf{y}-\bar{\textbf{y} }&=\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right)\textbf{y}\
\hat{\mathbf{y} }-\bar{\mathbf{y} }&=\left(\mathbf{H}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right)\mathbf{y}\
\textbf{y}-\hat{\textbf{y} }&=(\mathbf{I}-\mathbf{H})\mathbf{y}\
SST&=\textbf{y}\prime\left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)^\prime \left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right) \textbf{y}=\textbf{y}^\prime \left(\textbf{I}-\frac{\textbf{1}_n\textbf{1}_n^\prime}{n}\right) \textbf{y}\
SSR&=\mathbf{y}\prime\left(\mathbf{H}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)\prime\left(\mathbf{H}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)\mathbf{y}=\mathbf{y}\prime\left(\mathbf{H}-\frac{\textbf{1}_n\textbf{1}_n\prime}{n}\right)\mathbf{y}\
SSE&=\mathbf{y}^\prime (\mathbf{I}-\mathbf{H})^\prime (\mathbf{I}-\mathbf{H}) \mathbf{y}=\mathbf{y}^\prime (\mathbf{I}-\mathbf{H}) \mathbf{y}
\end{aligned}
所以 $$SST=SSR+SSE$$
与其他方法比较
线性回归 vs 逻辑回归
实现代码
Python
1 | import statsmodels.api as sm |
推荐阅读:《Post not found: 算法-特征归一化》
面试题
- State the assumptions in linear regression model
- 自变量和因变量线性相关
- 自变量间相互独立
- 误差项间相互独立
- 误差项服从均值为零、方差为常数的比正态分布
- 自变量和误差项相互独立
- How to avoid overfitting in linear regression?
- 减少特征的数量
-
减少自变量的权重,控制模型的复杂度。即,进行正则化Regularization
-
Explain gradient descent with respect to linear regression
-
How to choose the value of the parameter learning rate $\alpha$?
-
How to choose the regularization parameter?
未完
参考资料
dichotomous:二分类 ↩︎