值得注意的是,逻辑回归(logistic regression)解决的是有监督的分类问题,而非回归问题。
分类问题和回归问题的区别在于输出:分类问题的输出是离散型变量,如判断一个人是否得病,只有两种结果:得病或者不得病;而回归问题的输出为连续型变量,如预测一个人五年后的工资,它就可能是一个实数区间内的任意值。
实际上,logistic回归和多重线性回归除输出的不同外基本类似,这两种回归可以归于同一个家族,即广义线性模型(generalized linear model)。因此这两个回归的模型形式基本差不多,都可以简单表示为: y = Θ T X y=\Theta^TX y=ΘTX,只是,logistic回归需要借助sigmoid激活函数,将回归问题转化为分类问题。
sigmoid激活函数: g ( z ) = 1 ( 1 + e ( − z ) ) g(z)=\frac{1} {(1+e^{(-z)})} g(z)=(1+e(−z))1
sigmoid激活函数曲线如下:
由图可以看出,sigmoid函数中z的取值范围为任意实数,g(z)的输出范围为0到1,因此可以将输出看做一个概率值,将回归问题转化为分类问题。
sigmoid函数的一阶导数:
g ′ ( z ) = 1 ( 1 + e − z ) 2 ∗ ( e − z ) = 1 1 + e − z ∗ ( e − z 1 + e − z ) = 1 1 + e − z ∗ ( 1 − 1 1 + e − z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=\frac{1} {(1+e^{-z})^2}*(e^{-z})=\frac{1}{1+e^{-z}}*(\frac{e^{-z}}{1+e^{-z}})=\frac{1}{1+e^{-z}}*(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z)) g′(z)=(1+e−z)21∗(e−z)=1+e−z1∗(1+e−ze−z)=1+e−z1∗(1−1+e−z1)=g(z)(1−g(z))
模型: h θ ( x ) = g ( Θ x ) = 1 1 + e − Θ T x h_\theta(x)=g(\Theta x)=\frac{1}{1+e^{-\Theta^{T}x}} hθ(x)=g(Θx)=1+e−ΘTx1
因此:
{ P ( y = 1 ∣ x , Θ ) = h θ ( x ) P ( y = 0 ∣ x , Θ ) = 1 − h θ ( x ) \left\{\begin{matrix} P(y=1|x,\Theta )=h_\theta (x) \\P(y=0|x,\Theta ) =1-h_\theta (x) \end{matrix}\right. {P(y=1∣x,Θ)=hθ(x)P(y=0∣x,Θ)=1−hθ(x)
可将以上两式进行如下组合:
P ( y ∣ x , Θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y P(y|x,\Theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y} P(y∣x,Θ)=hθ(x)y(1−hθ(x))1−y
L ( Θ ) = P ( y ⃗ ∣ X , Θ ) = ∏ i = 1 m P ( y ( i ) ∣ x ( i ) , θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) ( y ( i ) ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) L(\Theta)=P(\vec{y}|X,\Theta)=\prod_{i=1}^{m}P(y^{(i)}|x^{(i)},\theta)=\prod_{i=1}^{m}(h_\theta(x^{(i)}))^{(y^{(i)})}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} L(Θ)=P(y∣X,Θ)=∏i=1mP(y(i)∣x(i),θ)=∏i=1m(hθ(x(i)))(y(i))(1−hθ(x(i)))1−y(i)
两边求对数得:
l o g L ( Θ ) = ∑ i = 1 m y ( i ) l o g h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) logL(\Theta)=\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)})) logL(Θ)=∑i=1my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))
d l o g L ( Θ ) d θ j = ( y 1 h θ ( x ) − ( 1 − y ) 1 1 − h θ ( x ) ) d d θ j h θ ( x ) \frac{\mathrm{d} logL(\Theta )}{\mathrm{d} \theta _j}=(y\frac{1}{h_\theta(x)}-(1-y)\frac{1}{1-h_\theta(x)})\frac{\mathrm{d} }{\mathrm{d} \theta _j}h_\theta(x) dθjdlogL(Θ)=(yhθ(x)1−(1−y)1−hθ(x)1)dθjdhθ(x)
将 h θ ( x ) = g ( Θ T x ) h_\theta(x)=g(\Theta^Tx) hθ(x)=g(ΘTx) 代入上式得:
d l o g L ( Θ ) d θ j = ( y 1 h θ ( x ) − ( 1 − y ) 1 1 − h θ ( x ) ) d d θ j h θ ( x ) = ( y 1 g ( θ T x ) − ( 1 − y ) 1 1 − g ( θ T x ) ) d d θ j g ( θ T x ) \frac{\mathrm{d} logL(\Theta )}{\mathrm{d} \theta _j}=(y\frac{1}{h_\theta(x)}-(1-y)\frac{1}{1-h_\theta(x)})\frac{\mathrm{d} }{\mathrm{d} \theta _j}h_\theta(x)=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})\frac{\mathrm{d} }{\mathrm{d} \theta _j}g(\theta^Tx) dθjdlogL(Θ)=(yhθ(x)1−(1−y)1−hθ(x)1)dθjdhθ(x)=(yg(θTx)1−(1−y)1−g(θTx)1)dθjdg(θTx)
因为 g ′ ( z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=g(z)(1-g(z)) g′(z)=g(z)(1−g(z)),所以上式可化简为:
d l o g L ( Θ ) d θ j = ( y 1 g ( θ T x ) − ( 1 − y ) 1 1 − g ( θ T x ) ) g ( θ T x ) ( 1 − g ( θ T x ) ) d d θ j θ T x \frac{\mathrm{d} logL(\Theta )}{\mathrm{d} \theta _j}=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})g(\theta^Tx)(1-g(\theta^Tx))\frac{\mathrm{d} }{\mathrm{d} \theta _j}\theta^Tx dθjdlogL(Θ)=(yg(θTx)1−(1−y)1−g(θTx)1)g(θTx)(1−g(θTx))dθjdθTx
d d θ j θ T x = x j \frac{\mathrm{d} }{\mathrm{d} \theta _j}\theta^Tx=x_j dθjdθTx=xj
所以:
d l o g L ( Θ ) d θ j = [ y ( 1 − g ( θ T x ) ) − ( 1 − y ) g ( θ T x ) ] x j = [ y − y g ( θ T x ) − g ( θ T x ) + y g ( θ T x ) ] x j \frac{\mathrm{d} logL(\Theta )}{\mathrm{d} \theta _j}=\left [ y(1-g(\theta^Tx))-(1-y)g(\theta^Tx) \right ]x_j=\left [ y-yg(\theta^Tx)-g(\theta^Tx)+yg(\theta^Tx) \right ]x_j dθjdlogL(Θ)=[y(1−g(θTx))−(1−y)g(θTx)]xj=[y−yg(θTx)−g(θTx)+yg(θTx)]xj
d l o g L ( Θ ) d θ j = ( y − g ( θ T x ) ) x j = ( y − h θ ( x ) ) x j \frac{\mathrm{d} logL(\Theta )}{\mathrm{d} \theta _j}=(y-g(\theta^Tx))x_j=(y-h_\theta(x))x_j dθjdlogL(Θ)=(y−g(θTx))xj=(y−hθ(x))xj
θ = θ − λ Δ \theta=\theta-\lambda\Delta θ=θ−λΔ
λ \lambda λ 即为学习率
Δ \Delta Δ即为梯度
因此对 θ j \theta_j θj 进行随机梯度下降:
θ j \theta_j θj: θ j − λ ( g ( θ T x ( i ) ) − y ( i ) ) x j ( i ) = θ j − λ ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j-\lambda(g(\theta^Tx^{(i)})-y^{(i)})x_{j}^{(i)}=\theta_j-\lambda(h_\theta(x^{(i)})-y^{(i)})x_{j}^{(i)} θj−λ(g(θTx(i))−y(i))xj(i)=θj−λ(hθ(x(i))−y(i))xj(i)
至此,我们完成了逻辑回归的梯度下降算法的全部推导过程。