先放一张来自Group Normalization原论文中的图,个人认为这个图很形象,以此图直观感受一下各种归一化的区别:
(注意:上图中,特征图的长和宽分别为W和H,由于我们的世界是3D的,直观只能展示3个维度,所以这里作者将H和W压缩成一个维度。则上图种每一个大方块展示的是一个Batch的特征图,其长宽高三个维度分别代表通道(Channel, C)、minibatch(BatchSize, N)、特征图(FeatureSize, (H,W)))
核心过程即如下算法流程:
BN可以加快深度神经网络的训练速度(可以在训练时使用更高的学习率,因为数据归一化之后,不会使得梯度有太大的波动),并给网络的权重提供了正则化,可以一定程度上防止过拟合
BN可以在任何场景下使用,你可以把它当作一个默认操作,后面再调整,但是这里要说一下调整的依据:
>>> # With Learnable Parameters
>>> m = nn.BatchNorm2d(100)
>>> # Without Learnable Parameters
>>> m = nn.BatchNorm2d(100, affine=False)
>>> input = torch.randn(20, 100, 35, 45)
>>> output = m(input)
前文已经大致描述了其他各种归一化方法的操作,后文就不再赘述了
>>> # NLP Example
>>> batch, sentence_length, embedding_dim = 20, 5, 10
>>> embedding = torch.randn(batch, sentence_length, embedding_dim)
>>> layer_norm = nn.LayerNorm(embedding_dim)
>>> # Activate module
>>> layer_norm(embedding)
>>>
>>> # Image Example
>>> N, C, H, W = 20, 5, 10, 10
>>> input = torch.randn(N, C, H, W)
>>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
>>> # as shown in the image below
>>> layer_norm = nn.LayerNorm([C, H, W])
>>> output = layer_norm(input)
>>> # Without Learnable Parameters
>>> m = nn.InstanceNorm2d(100)
>>> # With Learnable Parameters
>>> m = nn.InstanceNorm2d(100, affine=True)
>>> input = torch.randn(20, 100, 35, 45)
>>> output = m(input)
>>> input = torch.randn(20, 6, 10, 10)
>>> # Separate 6 channels into 3 groups
>>> m = nn.GroupNorm(3, 6)
>>> # Separate 6 channels into 6 groups (equivalent with InstanceNorm)
>>> m = nn.GroupNorm(6, 6)
>>> # Put all 6 channels into a single group (equivalent with LayerNorm)
>>> m = nn.GroupNorm(1, 6)
>>> # Activating the module
>>> output = m(input)