随着网络深度的增加,BP算法引起的梯度消失问题愈发严重
《Neural Networks and Deep Learning》
LSTM解决梯度消失的原理
论文《Deep Residual Learning for Image Recognition》中提到
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem ofvanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem,however, has been largely addressed bynormalized initialization[23, 9, 37, 13] andintermediate normalization layers[16], which enable networks with tens of layers to start convergingfor stochastic gradient descent (SGD) with backpropagation[22].
论文中提到了两种方法
normalized initialization
intermediate normalization layers
引用论文
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficu
[9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
[23] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998.
[37] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.