前一篇:单层感知机(Single Layer Perceptron)原理及Matlab实现,我们介绍了单层感知机,我们知道,它只能解决线性可分数据。对于下图的数据(左为两个同心圆构成的数据,右为异或问题)
,它将一直迭代且无法收敛。
对于以上两类数据,我们可以通过特征映射,将数据从低维空间映射到高维空间中(例如核函数),使得低维线性不可分问题转化为高维线性可分问题,从而将正负样本分开。如下图(图2来自西瓜书)
:
我们也可以通过BP神经网络解决这一问题。为了方便大家理解,我们在下文一步一步展开。
如果我们的样本输入为四个二维空间的样本集合:
[ x ] = [ x 1 x 2 ] = [ x 1 , 1 x 1 , 2 x 1 , 3 x 1 , 4 x 2 , 1 x 2 , 2 x 2 , 3 x 2 , 4 ] 2 × 4 = [ 0 0 1 1 0 1 0 1 ] 2 × 4 [x] = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ x_{2,1} & x_{2,2} & x_{2,3} & x_{2,4} \end{bmatrix}_{2\times4} = \begin{bmatrix} 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \end{bmatrix}_{2\times4} [x]=[x1x2]=[x1,1x2,1x1,2x2,2x1,3x2,3x1,4x2,4]2×4=[00011011]2×4所属的类标签
[ y ] = [ 1 0 1 0 ] 1 × 4 [y] = \begin{bmatrix} 1 & 0 & 1& 0 \end{bmatrix}_{1\times4} [y]=[1010]1×4
我们知道单层感知机无法解决上述经典异或问题,接下来我们叠加多层网络,如下图:
其中
本文画法仅为了后续推导的直观性
为什么我们的激活函数为非线性的?我们可以想象如果激活函数为线性函数时,那么无论经过多少层,神经网络也只是将输入线性组合后再输出:
y ^ = w n T ( ⋯ ( w 2 T ( w 1 T x + b 1 ) + b 2 ) + ⋯ ) + b n = w T x + b \hat y = w_n^T(\cdots (w_2^T(w_1^Tx+b_1)+b_2)+\cdots)+b_n = w^Tx+b y^=wnT(⋯(w2T(w1Tx+b1)+b2)+⋯)+bn=wTx+b这就等效于一个线性函数来代替,到头来变成了线性回归模型,隐藏层的作用也就消失了。
而对于非线性的激活函数的作用,可以看作:
如何设置神经元数和隐藏层数?
来自Andrew Ng课程:
下面我们将通过前向传播算法来求解预测值。
首先求解隐藏层的值: h = w T x + b h=w^Tx+b h=wTx+b。我们可以将偏置(bias)看成值为1,连接权值 w b w_b wb的神经元上图没添加
。那么相当于扩充矩阵 [ x ] [x] [x]:
[ x ] = [ 1 x 1 x 2 ] = [ 1 1 1 1 x 1 , 1 x 1 , 2 x 1 , 3 x 1 , 4 x 2 , 1 x 2 , 2 x 2 , 3 x 2 , 4 ] 3 × 4 = [ 1 1 1 1 0 0 1 1 0 1 0 1 ] 3 × 4 [x] = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1&1& 1&1 \\ x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ x_{2,1} & x_{2,2} & x_{2,3} & x_{2,4} \end{bmatrix}_{3\times4} = \begin{bmatrix} 1&1& 1&1 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \end{bmatrix}_{3\times4} [x]=⎣⎡1x1x2⎦⎤=⎣⎡1x1,1x2,11x1,2x2,21x1,3x2,31x1,4x2,4⎦⎤3×4=⎣⎡100101110111⎦⎤3×4对于 [ x ] [x] [x]的权值 w 1 w_1 w1,可以写成:
[ w ] = [ w 1 , 1 w 1 , 2 w 1 , 3 w 1 , 4 w 1 , 5 w 2 , 1 w 2 , 2 w 2 , 3 w 2 , 4 w 2 , 5 w 3 , 1 w 3 , 2 w 3 , 3 w 3 , 4 w 3 , 5 ] 3 × 5 [w] = \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} & w_{1,5} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} & w_{2,5} \\ w_{3,1} & w_{3,2} & w_{3,3} & w_{3,4} & w_{3,5} \\ \end{bmatrix}_{3\times5} [w]=⎣⎡w1,1w2,1w3,1w1,2w2,2w3,2w1,3w2,3w3,3w1,4w2,4w3,4w1,5w2,5w3,5⎦⎤3×5于是隐藏层的输入 [ h ˉ ] [\bar h] [hˉ]可以写成:
[ h ˉ ] = w T x = [ w 1 , 1 1 + w 2 , 1 x 1 , 1 + w 3 , 1 x 2 , 1 w 1 , 1 1 + w 2 , 1 x 1 , 2 + w 3 , 1 x 2 , 2 w 1 , 1 1 + w 2 , 1 x 1 , 3 + w 3 , 1 x 2 , 3 w 1 , 1 1 + w 2 , 1 x 1 , 4 + w 3 , 1 x 2 , 4 w 1 , 2 1 + w 2 , 2 x 1 , 1 + w 3 , 2 x 2 , 1 w 1 , 2 1 + w 2 , 2 x 1 , 2 + w 3 , 2 x 2 , 2 w 1 , 2 1 + w 2 , 2 x 1 , 3 + w 3 , 2 x 2 , 3 w 1 , 2 1 + w 2 , 2 x 1 , 4 + w 3 , 2 x 2 , 4 w 1 , 3 1 + w 2 , 3 x 1 , 1 + w 3 , 3 x 2 , 1 w 1 , 3 1 + w 2 , 3 x 1 , 2 + w 3 , 3 x 2 , 2 w 1 , 3 1 + w 2 , 3 x 1 , 3 + w 3 , 3 x 2 , 3 w 1 , 3 1 + w 2 , 3 x 1 , 4 + w 3 , 3 x 2 , 4 w 1 , 4 1 + w 2 , 4 x 1 , 1 + w 3 , 4 x 2 , 1 w 1 , 4 1 + w 2 , 4 x 1 , 2 + w 3 , 4 x 2 , 2 w 1 , 4 1 + w 2 , 4 x 1 , 3 + w 3 , 4 x 2 , 3 w 1 , 4 1 + w 2 , 4 x 1 , 4 + w 3 , 4 x 2 , 4 w 1 , 5 1 + w 2 , 5 x 1 , 1 + w 3 , 5 x 2 , 1 w 1 , 5 1 + w 2 , 5 x 1 , 2 + w 3 , 5 x 2 , 2 w 1 , 5 1 + w 2 , 5 x 1 , 3 + w 3 , 5 x 2 , 3 w 1 , 5 1 + w 2 , 5 x 1 , 4 + w 3 , 5 x 2 , 4 ] 5 × 4 = [ h ˉ 1 h ˉ 2 h ˉ 3 h ˉ 4 h ˉ 5 ] = [ h ˉ 1 , 1 h ˉ 1 , 2 h ˉ 1 , 3 h ˉ 1 , 4 h ˉ 2 , 1 h ˉ 2 , 2 h ˉ 2 , 3 h ˉ 2 , 4 h ˉ 3 , 1 h ˉ 3 , 2 h ˉ 3 , 3 h ˉ 2 , 4 h ˉ 4 , 1 h ˉ 4 , 2 h ˉ 4 , 3 h ˉ 4 , 4 h ˉ 5 , 1 h ˉ 5 , 2 h ˉ 5 , 3 h ˉ 5 , 4 ] 5 × 4 \begin{aligned} [\bar h] = w^T x&= \begin{bmatrix} w_{1,1}1 +w_{2,1}x_{1,1}+w_{3,1}x_{2,1} &w_{1,1}1 +w_{2,1}x_{1,2}+w_{3,1}x_{2,2} & w_{1,1}1 +w_{2,1}x_{1,3}+w_{3,1}x_{2,3} &w_{1,1}1 +w_{2,1}x_{1,4}+w_{3,1}x_{2,4} \\ w_{1,2}1 +w_{2,2}x_{1,1}+w_{3,2}x_{2,1} & w_{1,2}1 +w_{2,2}x_{1,2}+w_{3,2}x_{2,2} & w_{1,2}1 +w_{2,2}x_{1,3}+w_{3,2}x_{2,3} & w_{1,2}1 +w_{2,2}x_{1,4}+w_{3,2}x_{2,4} \\ w_{1,3}1 +w_{2,3}x_{1,1}+w_{3,3}x_{2,1} & w_{1,3}1 +w_{2,3}x_{1,2}+w_{3,3}x_{2,2} & w_{1,3}1 +w_{2,3}x_{1,3}+w_{3,3}x_{2,3} & w_{1,3}1 +w_{2,3}x_{1,4}+w_{3,3}x_{2,4} \\ w_{1,4}1 +w_{2,4}x_{1,1}+w_{3,4}x_{2,1} & w_{1,4}1 +w_{2,4}x_{1,2}+w_{3,4}x_{2,2} & w_{1,4}1 +w_{2,4}x_{1,3}+w_{3,4}x_{2,3} & w_{1,4}1 +w_{2,4}x_{1,4}+w_{3,4}x_{2,4} \\ w_{1,5}1 +w_{2,5}x_{1,1}+w_{3,5}x_{2,1} & w_{1,5}1 +w_{2,5}x_{1,2}+w_{3,5}x_{2,2} & w_{1,5}1 +w_{2,5}x_{1,3}+w_{3,5}x_{2,3} & w_{1,5}1 +w_{2,5}x_{1,4}+w_{3,5}x_{2,4} \\\end{bmatrix}_{5\times4} \\ &= \begin{bmatrix} \bar h_1 \\\bar h_2 \\\bar h_3 \\\bar h_4 \\\bar h_5 \end{bmatrix}= \begin{bmatrix} \bar h_{1,1} &\bar h_{1,2} & \bar h_{1,3} & \bar h_{1,4} \\\bar h_{2,1} & \bar h_{2,2} & \bar h_{2,3} & \bar h_{2,4} \\\bar h_{3,1} & \bar h_{3,2} & \bar h_{3,3} & \bar h_{2,4} \\\bar h_{4,1} & \bar h_{4,2} & \bar h_{4,3} & \bar h_{4,4} \\\bar h_{5,1} & \bar h_{5,2} & \bar h_{5,3} & \bar h_{5,4} \end{bmatrix}_{5\times4} \end{aligned} [hˉ]=wTx=⎣⎢⎢⎢⎢⎡w1,11+w2,1x1,1+w3,1x2,1w1,21+w2,2x1,1+w3,2x2,1w1,31+w2,3x1,1+w3,3x2,1w1,41+w2,4x1,1+w3,4x2,1w1,51+w2,5x1,1+w3,5x2,1w1,11+w2,1x1,2+w3,1x2,2w1,21+w2,2x1,2+w3,2x2,2w1,31+w2,3x1,2+w3,3x2,2w1,41+w2,4x1,2+w3,4x2,2w1,51+w2,5x1,2+w3,5x2,2w1,11+w2,1x1,3+w3,1x2,3w1,21+w2,2x1,3+w3,2x2,3w1,31+w2,3x1,3+w3,3x2,3w1,41+w2,4x1,3+w3,4x2,3w1,51+w2,5x1,3+w3,5x2,3w1,11+w2,1x1,4+w3,1x2,4w1,21+w2,2x1,4+w3,2x2,4w1,31+w2,3x1,4+w3,3x2,4w1,41+w2,4x1,4+w3,4x2,4w1,51+w2,5x1,4+w3,5x2,4⎦⎥⎥⎥⎥⎤5×4=⎣⎢⎢⎢⎢⎡hˉ1hˉ2hˉ3hˉ4hˉ5⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡hˉ1,1hˉ2,1hˉ3,1hˉ4,1hˉ5,1hˉ1,2hˉ2,2hˉ3,2hˉ4,2hˉ5,2hˉ1,3hˉ2,3hˉ3,3hˉ4,3hˉ5,3hˉ1,4hˉ2,4hˉ2,4hˉ4,4hˉ5,4⎦⎥⎥⎥⎥⎤5×4经过激活函数(这里先选缺点较多的sigmoid: σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1+e^{-x}} σ(x)=1+e−x1),变成 h h h作为隐藏层的输出):
h = [ σ ( h ˉ ) ] = [ σ ( h ˉ 1 ) σ ( h ˉ 2 ) σ ( h ˉ 3 ) σ ( h ˉ 4 ) σ ( h ˉ 5 ) ] = [ σ ( h ˉ 1 , 1 ) σ ( h ˉ 1 , 2 ) σ ( h ˉ 1 , 3 ) σ ( h ˉ 1 , 4 ) σ ( h ˉ 2 , 1 ) σ ( h ˉ 2 , 2 ) σ ( h ˉ 2 , 3 ) σ ( h ˉ 2 , 4 ) σ ( h ˉ 3 , 1 ) σ ( h ˉ 3 , 2 ) σ ( h ˉ 3 , 3 ) σ ( h ˉ 2 , 4 ) σ ( h ˉ 4 , 1 ) σ ( h ˉ 4 , 2 ) σ ( h ˉ 4 , 3 ) σ ( h ˉ 4 , 4 ) σ ( h ˉ 5 , 1 ) σ ( h ˉ 5 , 2 ) σ ( h ˉ 5 , 3 ) σ ( h ˉ 5 , 4 ) ] 5 × 4 h = [\sigma(\bar h)] = \begin{bmatrix} \sigma(\bar h_1) \\\sigma(\bar h_2) \\\sigma(\bar h_3) \\\sigma(\bar h_4) \\\sigma(\bar h_5) \end{bmatrix}= \begin{bmatrix} \sigma(\bar h_{1,1}) & \sigma(\bar h_{1,2}) & \sigma(\bar h_{1,3}) & \sigma(\bar h_{1,4}) \\\sigma(\bar h_{2,1}) & \sigma(\bar h_{2,2}) & \sigma(\bar h_{2,3}) & \sigma(\bar h_{2,4}) \\\sigma(\bar h_{3,1}) & \sigma(\bar h_{3,2}) & \sigma(\bar h_{3,3}) & \sigma(\bar h_{2,4}) \\\sigma( \bar h_{4,1}) & \sigma(\bar h_{4,2}) & \sigma(\bar h_{4,3}) &\sigma(\bar h_{4,4}) \\\sigma( \bar h_{5,1}) & \sigma(\bar h_{5,2}) & \sigma(\bar h_{5,3}) & \sigma(\bar h_{5,4}) \end{bmatrix}_{5\times4} h=[σ(hˉ)]=⎣⎢⎢⎢⎢⎡σ(hˉ1)σ(hˉ2)σ(hˉ3)σ(hˉ4)σ(hˉ5)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡σ(hˉ1,1)σ(hˉ2,1)σ(hˉ3,1)σ(hˉ4,1)σ(hˉ5,1)σ(hˉ1,2)σ(hˉ2,2)σ(hˉ3,2)σ(hˉ4,2)σ(hˉ5,2)σ(hˉ1,3)σ(hˉ2,3)σ(hˉ3,3)σ(hˉ4,3)σ(hˉ5,3)σ(hˉ1,4)σ(hˉ2,4)σ(hˉ2,4)σ(hˉ4,4)σ(hˉ5,4)⎦⎥⎥⎥⎥⎤5×4
同样,像第一步扩充矩阵 [ h ] [h] [h],得到
[ σ ( h ) ] = [ 1 σ ( h 1 ) σ ( h 2 ) σ ( h 3 ) σ ( h 4 ) σ ( h 5 ) ] = [ 1 1 1 1 σ ( h 1 , 1 ) σ ( h 1 , 2 ) σ ( h 1 , 3 ) σ ( h 1 , 4 ) σ ( h 2 , 1 ) σ ( h 2 , 2 ) σ ( h 2 , 3 ) σ ( h 2 , 4 ) σ ( h 3 , 1 ) σ ( h 3 , 2 ) σ ( h 3 , 3 ) σ ( h 2 , 4 ) σ ( h 4 , 1 ) σ ( h 4 , 2 ) σ ( h 4 , 3 ) σ ( h 4 , 4 ) σ ( h 5 , 1 ) σ ( h 5 , 2 ) σ ( h 5 , 3 ) σ ( h 5 , 4 ) ] 6 × 4 [\sigma(h)] = \begin{bmatrix} 1\\ \sigma(h_1) \\\sigma(h_2) \\\sigma(h_3) \\\sigma(h_4) \\\sigma(h_5) \end{bmatrix}= \begin{bmatrix} 1&1&1&1 \\ \sigma(h_{1,1}) & \sigma(h_{1,2}) & \sigma(h_{1,3}) & \sigma(h_{1,4}) \\\sigma(h_{2,1}) & \sigma(h_{2,2}) & \sigma(h_{2,3}) & \sigma(h_{2,4}) \\\sigma(h_{3,1}) & \sigma(h_{3,2}) & \sigma(h_{3,3}) & \sigma(h_{2,4}) \\\sigma( h_{4,1}) & \sigma(h_{4,2}) & \sigma(h_{4,3}) &\sigma(h_{4,4}) \\\sigma( h_{5,1}) & \sigma(h_{5,2}) & \sigma(h_{5,3}) & \sigma(h_{5,4}) \end{bmatrix}_{6\times4} [σ(h)]=⎣⎢⎢⎢⎢⎢⎢⎡1σ(h1)σ(h2)σ(h3)σ(h4)σ(h5)⎦⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎡1σ(h1,1)σ(h2,1)σ(h3,1)σ(h4,1)σ(h5,1)1σ(h1,2)σ(h2,2)σ(h3,2)σ(h4,2)σ(h5,2)1σ(h1,3)σ(h2,3)σ(h3,3)σ(h4,3)σ(h5,3)1σ(h1,4)σ(h2,4)σ(h2,4)σ(h4,4)σ(h5,4)⎦⎥⎥⎥⎥⎥⎥⎤6×4
继续前向传播,一般来说输出层节点数与分类的类别数相等。因为我们这里是二分类,所以使用两个节点。这里设置新权值w为:
[ w ] = [ w 1 , 1 w 1 , 2 w 2 , 1 w 2 , 2 w 3 , 1 w 3 , 2 w 4 , 1 w 4 , 2 w 5 , 1 w 5 , 2 w 6 , 1 w 6 , 2 ] 6 × 2 [w] = \begin{bmatrix} w_{1,1} & w_{1,2} \\ w_{2,1} & w_{2,2} \\ w_{3,1} & w_{3,2} \\ w_{4,1} & w_{4,2} \\ w_{5,1} & w_{5,2}\\ w_{6,1} & w_{6,2}\\ \end{bmatrix}_{6\times2} [w]=⎣⎢⎢⎢⎢⎢⎢⎡w1,1w2,1w3,1w4,1w5,1w6,1w1,2w2,2w3,2w4,2w5,2w6,2⎦⎥⎥⎥⎥⎥⎥⎤6×2
生成预测标签 y ^ \hat y y^,这里我们就不展开了:
[ y ^ ] = σ [ y ˉ ] = σ [ w T [ σ ( h ) ] ] = [ y ^ 1 , 1 y ^ 1 , 2 y ^ 1 , 3 y ^ 1 , 4 y ^ 2 , 1 y ^ 2 , 2 y ^ 2 , 3 y ^ 2 , 4 ] 2 × 4 [\hat y] = \sigma[\bar y] = \sigma[w^T [\sigma(h)]] = \begin{bmatrix} \hat y_{1,1} & \hat y_{1,2} & \hat y_{1,3} & \hat y_{1,4}\\ \hat y_{2,1} & \hat y_{2,2} & \hat y_{2,3} & \hat y_{2,4} \end{bmatrix}_{2\times4} [y^]=σ[yˉ]=σ[wT[σ(h)]]=[y^1,1y^2,1y^1,2y^2,2y^1,3y^2,3y^1,4y^2,4]2×4
在这里我们使用二值交叉熵(Cross Entropy Loss)损失函数来作为损失函数:
L c r o s s _ e n t r o p y = − 1 n ∑ i = 1 n [ y i log y ^ i + ( 1 − y i ) log ( 1 − y ^ i ) ] L_{cross\_entropy} = -\frac{1}{n}\sum_{i=1}^n[ y_i \log \hat y_i +(1- y_i) \log (1-\hat y_i)] Lcross_entropy=−n1i=1∑n[yilogy^i+(1−yi)log(1−y^i)]
其中样本数 n = 4 , y i n=4,y_i n=4,yi为数据的真实标签。
L c r o s s _ e n t r o p y = − 1 4 ∑ c o l ∑ r o w ( [ y ^ 1 , 1 log y 1 , 1 y ^ 1 , 2 log y 1 , 2 y ^ 1 , 3 log y 1 , 3 y ^ 1 , 4 log y 1 , 4 y ^ 2 , 1 log y 2 , 1 y ^ 2 , 2 log y 2 , 2 y ^ 2 , 3 log y 2 , 3 y ^ 2 , 4 log y 2 , 4 ] + [ ( 1 − y ^ 1 , 1 ) log ( 1 − y 1 , 1 ) ( 1 − y ^ 1 , 2 ) log ( 1 − y 1 , 2 ) ( 1 − y ^ 1 , 3 ) log ( 1 − y 1 , 3 ) ( 1 − y ^ 1 , 4 ) log ( 1 − y 1 , 4 ) ( 1 − y ^ 2 , 1 ) log ( 1 − y 2 , 1 ) ( 1 − y ^ 2 , 2 ) log ( 1 − y 2 , 2 ) ( 1 − y ^ 2 , 3 ) log ( 1 − y 2 , 3 ) ( 1 − y ^ 2 , 4 ) log ( 1 − y 2 , 4 ) ] ) \begin{aligned} &L_{cross\_entropy} = - \frac{1}{4} \sum_{col} \sum_{row} (\begin{bmatrix} \hat y_{1,1}\log y_{1,1} & \hat y_{1,2}\log y_{1,2} & \hat y_{1,3}\log y_{1,3} & \hat y_{1,4}\log y_{1,4}\\ \hat y_{2,1}\log y_{2,1} & \hat y_{2,2}\log y_{2,2} & \hat y_{2,3}\log y_{2,3} & \hat y_{2,4}\log y_{2,4} \end{bmatrix} \\+ &\begin{bmatrix} (1-\hat y_{1,1})\log (1-y_{1,1}) & (1-\hat y_{1,2})\log (1-y_{1,2}) & (1-\hat y_{1,3})\log (1-y_{1,3}) & (1-\hat y_{1,4})\log(1- y_{1,4})\\ (1-\hat y_{2,1})\log (1-y_{2,1}) & (1-\hat y_{2,2})\log (1-y_{2,2}) & (1-\hat y_{2,3})\log (1-y_{2,3}) & (1-\hat y_{2,4})\log (1-y_{2,4}) \end{bmatrix}) \end{aligned} +Lcross_entropy=−41col∑row∑([y^1,1logy1,1y^2,1logy2,1y^1,2logy1,2y^2,2logy2,2y^1,3logy1,3y^2,3logy2,3y^1,4logy1,4y^2,4logy2,4][(1−y^1,1)log(1−y1,1)(1−y^2,1)log(1−y2,1)(1−y^1,2)log(1−y1,2)(1−y^2,2)log(1−y2,2)(1−y^1,3)log(1−y1,3)(1−y^2,3)log(1−y2,3)(1−y^1,4)log(1−y1,4)(1−y^2,4)log(1−y2,4)])
所以前向传播可以看成,神经网络从第一层开始,层层向前计算并传播的过程。多层级联的神经网络结合激活函数(非线性的激活函数可以使得决策面不再为直线),使得神经网络具有解决非线性可分数据的能力。
对于反向传播,我们可以看成将最后损失函数的梯度层层传递给前面层,通过计算图模型(Computational Graph)来将误差层层回溯,关于计算图很多讲解都很明确cs231n、李宏毅老师的ML课程
,这里就不说了,主要就是偏微分链式求导法则chain rule
性质,下文中也有体现。
已知交叉熵损失函数为:
L c r o s s _ e n t r o p y = − 1 n ∑ i = 1 n [ y i log y ^ i + ( 1 − y i ) log ( 1 − y ^ i ) ] L_{cross\_entropy} = -\frac{1}{n}\sum_{i=1}^n[ y_i \log \hat y_i +(1- y_i) \log (1-\hat y_i)] Lcross_entropy=−n1i=1∑n[yilogy^i+(1−yi)log(1−y^i)]
我们先求交叉熵损失关于输出层中的输入 y ˉ i \bar y_i yˉi的梯度:
∂ L ∂ y ^ i = − ( y i y ^ i − 1 − y i 1 − y ^ i ) ∂ y ^ i ∂ y ˉ i = ∂ 1 1 + e − y ˉ i ∂ y ˉ i = − e − y ˉ i ( 1 − e − y ˉ i ) 2 = y ^ i ( 1 − y ^ i ) ∂ L ∂ y ˉ i = ∂ L ∂ y ^ i ⋅ ∂ y ^ i ∂ y ˉ i = − [ y i y ^ i − 1 − y i 1 − y ^ i ] y ^ i ( 1 − y ^ i ) = − y i ( 1 − y ^ i ) + ( 1 − y i ) y ^ i = y ^ i − y i \begin{aligned} \frac{\partial L}{\partial \hat y_i} &= -( \frac{y_i}{ \hat y_i} -\frac{1- y_i} {1-\hat y_i}) \\ \frac{\partial \hat y_i}{\partial \bar y_i} &= \frac{\partial \frac{1}{1+e^{-\bar y_i}}}{\partial \bar y_i} = \frac{-e^{-\bar y_i}}{(1-e^{-\bar y_i})^2} = \hat y_i(1-\hat y_i) \\ \frac{\partial L}{\partial \bar y_i} &=\frac{\partial L}{\partial \hat y_i}\cdot \frac{\partial \hat y_i}{\partial \bar y_i} = -[ \frac{y_i}{ \hat y_i} -\frac{1- y_i} {1-\hat y_i}] \hat y_i(1-\hat y_i)\\ &=-y_i(1-\hat y_i)+(1-y_i)\hat y_i\\ &=\hat y_i - y_i \end{aligned} ∂y^i∂L∂yˉi∂y^i∂yˉi∂L=−(y^iyi−1−y^i1−yi)=∂yˉi∂1+e−yˉi1=(1−e−yˉi)2−e−yˉi=y^i(1−y^i)=∂y^i∂L⋅∂yˉi∂y^i=−[y^iyi−1−y^i1−yi]y^i(1−y^i)=−yi(1−y^i)+(1−yi)y^i=y^i−yi
输出层-隐藏层,隐藏层-输入层中的神经元间O - Out, I - In
关系可以写为
O = σ [ w T I ] O = \sigma[w^TI] O=σ[wTI]
于是求损失关于神经元的梯度可以写成其中L/∂O已知
:
L ◍ = ∂ O ∂ I ⊙ L ∂ O = w ⋅ ( O ⊙ ( 1 − O ) ⊙ L ∂ O ) L_{◍} = \frac {\partial O}{\partial I} \odot \frac{L}{\partial O}= w\cdot( O\odot(1-O) \odot \frac{L}{\partial O}) L◍=∂I∂O⊙∂OL=w⋅(O⊙(1−O)⊙∂OL)
以上述隐藏层的输出◍和输入层◍为例,这里 h h h 指代反向传播的梯度:
h = [ h 1 h 2 h 3 h 4 h 5 ] , h ⊙ ( 1 − h T ) = [ h 1 ( 1 − h 1 ) h 2 ( 1 − h 2 ) h 3 ( 1 − h 3 ) h 4 ( 1 − h 4 ) h 5 ( 1 − h 5 ) ] L ◍ = [ L ◍ 1 L ◍ 2 ] = w ⋅ ( h T ⊙ ( 1 − h T ) ⊙ L ∂ h ) = [ w 2 , 1 w 2 , 2 w 2 , 3 w 2 , 4 w 2 , 5 w 3 , 1 w 3 , 2 w 3 , 3 w 3 , 4 w 3 , 5 ] 2 × 5 ⋅ ( [ h 1 ( 1 − h 1 ) h 2 ( 1 − h 2 ) h 3 ( 1 − h 3 ) h 4 ( 1 − h 4 ) h 5 ( 1 − h 5 ) ] 5 × 1 ⊙ L ∂ h ) \begin{aligned} h &= \begin{bmatrix} h_1 \\h_2 \\h_3 \\h_4 \\h_5 \end{bmatrix}, h\odot(1-h^T) = \begin{bmatrix} h_1(1-h_1) \\ h_2(1-h_2) \\ h_3(1-h_3) \\ h_4(1-h_4) \\ h_5(1-h_5) \end{bmatrix}\\ L_{◍} &= \begin{bmatrix} L_{◍^1}\\ L_{◍^2} \end{bmatrix}= w\cdot (h^T\odot(1-h^T) \odot \frac{L}{\partial h}) \\&= \begin{bmatrix} w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} & w_{2,5} \\ w_{3,1} & w_{3,2} & w_{3,3} & w_{3,4} & w_{3,5} \\ \end{bmatrix}_{2\times5}\cdot( \begin{bmatrix} h_1(1-h_1) \\ h_2(1-h_2) \\ h_3(1-h_3) \\ h_4(1-h_4) \\ h_5(1-h_5) \end{bmatrix}_{5\times1}\odot \frac{L}{\partial h}) \end{aligned} hL◍=⎣⎢⎢⎢⎢⎡h1h2h3h4h5⎦⎥⎥⎥⎥⎤,h⊙(1−hT)=⎣⎢⎢⎢⎢⎡h1(1−h1)h2(1−h2)h3(1−h3)h4(1−h4)h5(1−h5)⎦⎥⎥⎥⎥⎤=[L◍1L◍2]=w⋅(hT⊙(1−hT)⊙∂hL)=[w2,1w3,1w2,2w3,2w2,3w3,3w2,4w3,4w2,5w3,5]2×5⋅(⎣⎢⎢⎢⎢⎡h1(1−h1)h2(1−h2)h3(1−h3)h4(1−h4)h5(1−h5)⎦⎥⎥⎥⎥⎤5×1⊙∂hL)
我们可以观察神经元间的连接线(蓝线、紫线、黑线)来验证为什么这里是点乘和叉乘。
由上步我们很容易得到L◍和L◍的梯度公式。
对于以上梯度的链式求导,我们可以使用雅可比矩阵(Jacobian Matrix)求解:
[ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ] = [ ∂ y 1 ∂ g 1 ∂ y 1 ∂ g 2 ⋯ ∂ y 1 ∂ g i ∂ y 2 ∂ g 1 ∂ y 2 ∂ g 2 ⋯ ∂ y 2 ∂ g i ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ g 1 ∂ y m ∂ g 2 ⋯ ∂ y m ∂ g i ] [ ∂ g 1 ∂ x 1 ∂ g 1 ∂ x 2 ⋯ ∂ g 1 ∂ x n ∂ g 2 ∂ x 1 ∂ g 2 ∂ x 2 ⋯ ∂ g 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ g i ∂ x 1 ∂ g i ∂ x 2 ⋯ ∂ g i ∂ x n ] \begin{bmatrix} \frac {\partial y_1}{\partial x_1} & \frac {\partial y_1}{\partial x_2} & \cdots &\frac{\partial y_1}{\partial x_n} \\ \frac {\partial y_2}{\partial x_1} & \frac {\partial y_2}{\partial x_2} & \cdots &\frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots\\ \frac {\partial y_m}{\partial x_1} & \frac {\partial y_m}{\partial x_2} & \cdots &\frac{\partial y_m}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \frac {\partial y_1}{\partial g_1} & \frac {\partial y_1}{\partial g_2} & \cdots &\frac{\partial y_1}{\partial g_i} \\ \frac {\partial y_2}{\partial g_1} & \frac {\partial y_2}{\partial g_2} & \cdots &\frac{\partial y_2}{\partial g_i} \\ \vdots & \vdots & \ddots & \vdots\\ \frac {\partial y_m}{\partial g_1} & \frac {\partial y_m}{\partial g_2} & \cdots &\frac{\partial y_m}{\partial g_i} \end{bmatrix} \begin{bmatrix} \frac {\partial g_1}{\partial x_1} & \frac {\partial g_1}{\partial x_2} & \cdots &\frac{\partial g_1}{\partial x_n} \\ \frac {\partial g_2}{\partial x_1} & \frac {\partial g_2}{\partial x_2} & \cdots &\frac{\partial g_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots\\ \frac {\partial g_i}{\partial x_1} & \frac {\partial g_i}{\partial x_2} & \cdots &\frac{\partial g_i}{\partial x_n} \end{bmatrix} ⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡∂g1∂y1∂g1∂y2⋮∂g1∂ym∂g2∂y1∂g2∂y2⋮∂g2∂ym⋯⋯⋱⋯∂gi∂y1∂gi∂y2⋮∂gi∂ym⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎡∂x1∂g1∂x1∂g2⋮∂x1∂gi∂x2∂g1∂x2∂g2⋮∂x2∂gi⋯⋯⋱⋯∂xn∂g1∂xn∂g2⋮∂xn∂gi⎦⎥⎥⎥⎥⎤
我们已经求出来损失函数关于每个神经元的梯度L_◍和L◍,下一步就是求损失函数关于权值 w w w的梯度,以便对权值进行更新。
因为权值和下一层神经元的关系可以表示为
L ◍ = w T x L_◍ = w^Tx L◍=wTx
所以损失函数对于权值 w w w的梯度可以由上一层的神经元传递下来,对上述求导可得:
∂ L ◍ w = x \frac {\partial L_◍}{w} = x w∂L◍=x
我们以输入层为例
[ ∂ L ∂ w ] = [ ∂ L ∂ w 1 , 1 ∂ L ∂ w 1 , 2 ∂ L ∂ w 1 , 3 ∂ L ∂ w 1 , 4 ∂ L ∂ w 1 , 5 ∂ L ∂ w 2 , 1 ∂ L ∂ w 2 , 2 ∂ L ∂ w 2 , 3 ∂ L ∂ w 1 , 4 ∂ L ∂ w 2 , 5 ∂ L ∂ w 3 , 1 ∂ L ∂ w 3 , 2 ∂ L ∂ w 3 , 3 ∂ L ∂ w 3 , 4 ∂ L ∂ w 3 , 5 ] 3 × 5 = [ ∂ L ∂ h ˉ 1 ⋅ 1 ∂ L ∂ h ˉ 1 ⋅ 1 ∂ L ∂ h ˉ 1 ⋅ 1 ∂ L ∂ h ˉ 1 ⋅ 1 ∂ L ∂ h ˉ 1 ⋅ 1 ∂ L ∂ h ˉ 2 ⋅ x 1 ∂ L ∂ h ˉ 2 ⋅ x 1 ∂ L ∂ h ˉ 2 ⋅ x 1 ∂ L ∂ h ˉ 2 ⋅ x 1 ∂ L ∂ h ˉ 2 ⋅ x 1 ∂ L ∂ h ˉ 3 ⋅ x 3 ∂ L ∂ h ˉ 3 ⋅ x 3 ∂ L ∂ h ˉ 3 ⋅ x 3 ∂ L ∂ h ˉ 3 ⋅ x 3 ∂ L ∂ h ˉ 3 ⋅ x 3 ] 3 × 5 \begin{aligned} [\frac{\partial L}{\partial w}] &= \begin{bmatrix} \frac{\partial L}{\partial w_{1,1}} & \frac{\partial L}{\partial w_{1,2}} & \frac{\partial L}{\partial w_{1,3}} & \frac{\partial L}{\partial w_{1,4}} & \frac{\partial L}{\partial w_{1,5}} \\ \frac{\partial L}{\partial w_{2,1}} & \frac{\partial L}{\partial w_{2,2}} & \frac{\partial L}{\partial w_{2,3}} & \frac{\partial L}{\partial w_{1,4}} & \frac{\partial L}{\partial w_{2,5}} \\ \frac{\partial L}{\partial w_{3,1}} & \frac{\partial L}{\partial w_{3,2}} & \frac{\partial L}{\partial w_{3,3}} & \frac{\partial L}{\partial w_{3,4}} & \frac{\partial L}{\partial w_{3,5}} \\ \end{bmatrix}_{3\times5}\\ &=\begin{bmatrix} \frac{\partial L}{\partial \bar h_1} \cdot 1& \frac{\partial L}{\partial \bar h_1}\cdot 1 & \frac{\partial L}{\partial \bar h_1}\cdot 1 & \frac{\partial L}{\partial \bar h_1} \cdot 1& \frac{\partial L}{\partial \bar h_1} \cdot 1\\ \frac{\partial L}{\partial \bar h_2} \cdot x_1& \frac{\partial L}{\partial \bar h_2}\cdot x_1 & \frac{\partial L}{\partial \bar h_2}\cdot x_1 & \frac{\partial L}{\partial \bar h_2} \cdot x_1& \frac{\partial L}{\partial \bar h_2} \cdot x_1\\ \frac{\partial L}{\partial \bar h_3} \cdot x_3& \frac{\partial L}{\partial \bar h_3}\cdot x_3 & \frac{\partial L}{\partial \bar h_3}\cdot x_3 & \frac{\partial L}{\partial \bar h_3} \cdot x_3& \frac{\partial L}{\partial \bar h_3} \cdot x_3\\ \end{bmatrix}_{3\times5}\\ & \end{aligned} [∂w∂L]=⎣⎢⎡∂w1,1∂L∂w2,1∂L∂w3,1∂L∂w1,2∂L∂w2,2∂L∂w3,2∂L∂w1,3∂L∂w2,3∂L∂w3,3∂L∂w1,4∂L∂w1,4∂L∂w3,4∂L∂w1,5∂L∂w2,5∂L∂w3,5∂L⎦⎥⎤3×5=⎣⎢⎡∂hˉ1∂L⋅1∂hˉ2∂L⋅x1∂hˉ3∂L⋅x3∂hˉ1∂L⋅1∂hˉ2∂L⋅x1∂hˉ3∂L⋅x3∂hˉ1∂L⋅1∂hˉ2∂L⋅x1∂hˉ3∂L⋅x3∂hˉ1∂L⋅1∂hˉ2∂L⋅x1∂hˉ3∂L⋅x3∂hˉ1∂L⋅1∂hˉ2∂L⋅x1∂hˉ3∂L⋅x3⎦⎥⎤3×5
到此权值的梯度已经获得,接下来就是梯度下降来进行迭代求解。
梯度下降(二)
梯度下降(二)
这里就不细说了,本文采用Adam+小批梯度下降进行梯度下降。
下面是使用Sigmoid作为激活函数对异或数据分类的结果(1):
下面是使用Sigmoid作为激活函数对同心圆数据分类的结果(2):
下面是使用ReLU作为激活函数对异或数据分类的结果:
下面是使用Sigmoid作为激活函数对同心圆数据分类的结果:
下面是使用ReLU作为激活函数对同心圆数据分类的结果:
把输入经过多层的传递以及非线性变换后的输出画出来,如下:
可以看到,数据变成线性可分了。
本代码参考voidbip ,修改了梯度下降及权值初始化部分,以及增加了ReLU模块,并在其他地方有一定的更改:
% Reference : https://github.com/voidbip/matlab_nn
clc;clear;clf;
X = [0 0 1 1; 0 1 0 1]; % x -> Data set 2-dimension data
flag = [0 1 1 0]; % y -> Flag / label
n = 100; % The first and second data sets
a = linspace(0,2*pi,n/2); % Set the values for x
u = [5*cos(a)+5 10*cos(a)+5]+1*rand(1,n);
v = [5*sin(a)+5 10*sin(a)+5]+1*rand(1,n);
X = [u;v];
flag = [zeros(1,n/2),ones(1,n/2)];
classNum = length(unique(flag)); % How many classes?
[row, col] = size(X); % row -> dimension, col -> size of dataset
NNLayer = [row 20 classNum]; % The structure of our neuron networks
% [1] Initialize weights randomly
w = randInitWeights(NNLayer);
iteration = 10000; % Set our iterations
acMethod = 'SIGMOID'; % Set our activation functions
lambda = 0;
flagMatrix = zeros(classNum,col);
for i = 1 : length(flag)
flagMatrix(flag(i)+1,i) = 1;
end
% - Mini-Batch Gradient Descent Params - %
batchSize = 4;
% - Adam Params - %
eta = 0.002; % Learning rate
s = 0; beta = 0.99; momentum = 0; gamma = 0.9; cnt = 0;
%- draw -%
Range = [-10, 20; -10, 20]; % set the range of dataset
figure(1);
hold on;
posFlag = find(flag == 1);
negFlag = find(flag == 0);
plot(X(1,posFlag), X(2,posFlag), 'r+','linewidth',2);
plot(X(1,negFlag), X(2,negFlag), 'bo','linewidth',2);
[h_region1,h_region2] = drawRegion(Range,w,NNLayer,acMethod);
for i = 1 : iteration
% %i
% cnt = cnt+1;
if(mod(i,100)==0)
delete(h_region1);delete(h_region2);
wFinal = w;
[h_region1,h_region2] = drawRegion(Range,wFinal,NNLayer,acMethod);
title('Data Fitting Using Neuron Networks');
legend('class 1','class 2','seprated region');
xlabel('x');
ylabel('y')
drawnow;
end
% Mini-batch gradient descent + Adam 懒得写成函数了
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
flagBatch = flagMatrix(:,batchIdx(1:batchSize));
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for batchIdx = 1 : k
valMatrix = ForwardPropagation(X(:,batchIdx1(batchIdx,:)),w,NNLayer,acMethod);
[j,jw] = BackwardPropagation(flagMatrix(:,batchIdx1(batchIdx,:)), valMatrix, w, lambda, NNLayer, acMethod);
cnt = cnt+1;
if j<0.01
break;
end
[sizeW,~] = size(jw);
eps = 10^-8*ones(sizeW,1);
s = beta*s + (1-beta)*jw.*jw; % Update s
momentum = gamma*momentum + (1-gamma).*jw; % Update momentum
momentum_bar = momentum/(1-gamma^cnt);
s_bar = s /(1-beta^cnt);
w = w - eta./sqrt(eps+s_bar).*momentum_bar; % Update parameters(theta)
end
if(~isempty(batchIdx2))
valMatrix = ForwardPropagation(X(:,batchIdx2),w,NNLayer,acMethod);
[j,jw] = BackwardPropagation(flagMatrix(:,batchIdx2), valMatrix, w, lambda, NNLayer, acMethod);
cnt = cnt+1;
%if j<0.01
% break;
%end
[sizeW,~] = size(jw);
eps = 10^-8*ones(sizeW,1);
s = beta*s + (1-beta)*jw.*jw; % Update s
momentum = gamma*momentum + (1-gamma).*jw; % Update momentum
momentum_bar = momentum/(1-gamma^cnt);
s_bar = s /(1-beta^cnt);
w = w - eta./sqrt(eps+s_bar).*momentum_bar; % Update parameters(theta)
end
% Batch gradient descent
% valMatrix = ForwardPropagation(X,w,NNLayer,acMethod);
% [j,jw] = BackwardPropagation(flagMatrix, valMatrix, w, lambda, NNLayer, acMethod);
% w = w-eta*jw;
% j
% if j<0.1
% break;
% end
end
hold off;
%% Initialize Weights Randomly
% input: [2 10 2]
% layer1: 2 neurons + 1 bias.
% layer2: 10 neurons + 1 bias.
% layer3: 2 neurons.
function [w] = randInitWeights(NNLayer)
Len = length(NNLayer); % Obtain the number of layers
shiftLayer = [0 ones(1,Len-1)+NNLayer(1:Len-1)]; % shiftLayer = NNLayer + 1(bias), shiftLayer >> 1。
wCount = NNLayer.*shiftLayer; % The number of weights for previous layer <-> shiftLayer .* NNLayer
w = zeros(sum(wCount),1); % Initialize weight vector
accWIdx = cumsum(wCount); % The index of each layer for weight vector
for i = 2 : Len
eps = sqrt(6)/sqrt(NNLayer(i) + shiftLayer(i));
w(accWIdx(i-1)+1:accWIdx(i)) = eps*(2*rand(wCount(i),1) - 1);
end
end
%% FeedForward Propagation
function [valMatrix] = ForwardPropagation(X, w, NNLayer,acMethod)
[dim, num] = size(X);
Len = length(NNLayer); % Obtain the number of layers
shiftLayer = [0 ones(1,Len-1)+NNLayer(1:Len-1)]; % shiftLayer = NNLayer + 1(bias), shiftLayer >> 1。
accWIdx = NNLayer.*shiftLayer; % The number of weights for previous layer <-> shiftLayer .* NNLayer
ws = cumsum(accWIdx); % The index of each layer for weight vector
accValIdx = [0 cumsum(NNLayer)];
if(dim ~= NNLayer(1))
error("dim of data != dim of input of NN");
end
valMatrix = zeros(sum(NNLayer),num);
valMatrix(1:dim,:) = X;
for i = 2: Len
%curLayerW = reshape(w(ws(i-1)+1:ws(i)),NNLayer(i),shiftLayer(i))';
curLayerW = reshape(w(ws(i-1)+1:ws(i)),shiftLayer(i),NNLayer(i));
valMatrix(accValIdx(i)+1:accValIdx(i+1),:) = activateFunc(curLayerW'*[ones(1,num);valMatrix(accValIdx(i-1)+1:accValIdx(i),:)],acMethod);
end
end
%% Backward Propagation
function [CELoss,jw] = BackwardPropagation(y, valMatrix, w, lambda, NNLayer, acMethod)
Len = length(NNLayer);
[~,num] = size(y);
gradX = zeros(sum(NNLayer(2:end)),num);
jw = zeros(length(w),1);
% CrossEntropy to calculate loss
% Output values: valMatrix(end-NNLayer(end)+1:end,:)
y_hat = valMatrix(end-NNLayer(end)+1:end,:) + 1e-7;
% This is Cross Entropy Loss value
CELoss = -sum(sum(y.*log(y_hat)+(1-y).*log(1-y_hat)))/num;
CELoss = CELoss + ((lambda*sum(w.^2))/(2*num)); % Regularization term
% Easy way for sigmoid function
gradX(end-NNLayer(end)+1:end,:) = y_hat - y; % Obtain the gradient of Cross Entropy / back to y_hat
%gradCE = -(y./y_hat-(1-y)./(1-y_hat));
%gradX(end-NNLayer(end)+1:end,:) = gradCE.*calculateGrad(y_hat,'Sigmoid');
shiftLayer = [0 ones(1,Len-1)+NNLayer(1:Len-1)]; % shiftLayer = NNLayer + 1(bias), shiftLayer >> 1。
accWIdx = NNLayer.*shiftLayer; % The number of weights for previous layer <-> shiftLayer .* NNLayer
ws = cumsum(accWIdx); % The index of each layer for weight vector
gradIdx = [0 cumsum(NNLayer(2:end))]; % Obtain the gradient for each neurons except which in the first layer
ai=[0 cumsum(NNLayer)];
% -- Calculate the gradient of neurons -- %
for i = Len:-1:3
%curLayerW = reshape(w(ws(i-1)+1: ws(i),:),NNLayer(i), shiftLayer(i))'; % Obtain weights between current adjacent layers
curLayerW = reshape(w(ws(i-1)+1: ws(i),:), shiftLayer(i),NNLayer(i)); % Obtain weights between current adjacent layers
curLayerW4X = curLayerW(2:end,:); % Remove the gradients of biases
gradBack = gradX(gradIdx(i-1)+1:gradIdx(i),:); % Get gradients from the next layer
%gradSigmoid = calculateGrad(valMatrix(ai(i-1)+1:ai(i),:),acMethod);
%gradX(gradIdx(i-2)+1:gradIdx(i-1),:) = curLayerW4X*gradBack.*gradSigmoid; % Calculate the gradient of neurons in current layer.
gradActiveFunc = calculateGrad(valMatrix(ai(i)+1:ai(i+1),:),acMethod);
gradX(gradIdx(i-2)+1:gradIdx(i-1),:) = curLayerW4X*(gradActiveFunc.*gradBack); % Calculate the gradient of neurons in current layer.
end
% -- Calculate the gradient for weights -- %
for i = Len:-1:2
temp = zeros(accWIdx(i),num);
for cnt = 1:num
%temp(:,cnt) = kron([1; valMatrix(ai(i-1)+1:ai(i),cnt)],gradX(gradIdx(i-1)+1:gradIdx(i),cnt));
temp(:,cnt) = kron(gradX(gradIdx(i-1)+1:gradIdx(i),cnt),[1; valMatrix(ai(i-1)+1:ai(i),cnt)]);
end
jw(1+ws(i-1):ws(i))= sum(temp,2);
end
jw = jw/num;
jw=jw + lambda*w/num;
end
function val = activateFunc(x,acMethod)
switch acMethod
case {'SIGMOID','sigmoid'}
val = 1.0./(1.0+exp(-x));
case {'TANH','tanh'}
val = tanh(x);
case {'ReLU','relu'}
val=max(0,x);
case {'tansig'}
val=2/(1+exp(-2*x))-1
otherwise
end
end
function val = calculateGrad(x,acMethod)
switch acMethod
case {'SIGMOID','sigmoid'}
val = (1-x).*x;
case {'TANH','tanh'}
error('...'); % TODO...
case {'ReLU','relu'}
val=x>0;
case {'tansig'}
error('...'); % TODO...
otherwise
error('...'); % TODO...
end
end
function [h_region1, h_region2] = drawRegion(Range,w,NNLayer,acMethod)
% Draw region
x_draw=Range(1):0.1:Range(3);
y_draw=Range(2):0.1:Range(4);
[meshX,meshY]=meshgrid(x_draw,y_draw);
[row, col] = size(meshX);
classes = zeros(row,col);
for i = 1:row
valMatrix = ForwardPropagation([meshX(i,:); meshY(i,:)],w,NNLayer,acMethod);
val = valMatrix(end,:)-valMatrix(end-1,:);
classes(i,:) =(val>0)-(val<0); % class(pos) = 1, class(neg) = -1;
end
[row, col] = find(classes == 1);
h_region1 = scatter(x_draw(col),y_draw(row),'MarkerFaceColor','r','MarkerEdgeColor','r');
h_region1.MarkerFaceAlpha = 0.03;
h_region1.MarkerEdgeAlpha = 0.03;
[row, col] = find(classes == -1);
h_region2 = scatter(x_draw(col),y_draw(row),'MarkerFaceColor','b','MarkerEdgeColor','b');
h_region2.MarkerFaceAlpha = 0.03;
h_region2.MarkerEdgeAlpha = 0.03;
end