假设一个三分类问题 ( C = 3 ) (C=3) (C=3),有两个样本 ( n = 2 ) (n=2) (n=2),真实标签 y = [ 1 , 2 ] y=[1, 2] y=[1,2],对应的one-hot向量:
y = [ 0 1 0 0 0 1 ] y= \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} y=[001001]
采用标签平滑策略对真实的标签进行处理,将其变为软标签。假设 ϵ = 0.1 \epsilon=0.1 ϵ=0.1,根据平滑公式 y l s = y ( 1 − ϵ ) + ϵ / C y^{ls}=y(1-\epsilon)+\epsilon/C yls=y(1−ϵ)+ϵ/C,则平滑后的标签为:
y l s = [ 0.0333 0.9333 0.0333 0.0333 0.0333 0.9333 ] y^{ls}= \begin{bmatrix} 0.0333 & 0.9333 & 0.0333 \\ 0.0333 & 0.0333 & 0.9333 \end{bmatrix} yls=[0.03330.03330.93330.03330.03330.9333]
根据文献[1]的分析,标签平滑有如下优缺点:
下图是作者论文中提到的一种可视化方法,用于对比采用标签平滑与不用平滑的特征可视化区别。可以观察到,采用标签平滑后,训练集中的每个类别的样本与其余类别的样本形成等距的三角形:
We observe that now the clusters are much tighter, because label smoothing encourages that each example in training set to be equidistant from all the other class’s templates. Therefore, when looking at the projections, the clusters organize in regular triangles when training with label smoothing, whereas the regular triangle structure is less discernible in the case of training with hard-targets (no label smoothing)。
交叉熵:
H ( y , p ) = − ∑ c = 1 C y c log p c H(\bold{y},\bold{p})=-\sum_{c=1}^{C}y_c\log p_c H(y,p)=−c=1∑Cyclogpc
其中, y c y_c yc是硬标签,若正确的类别则等于1,错误的类别则等于0; p c p_c pc是预测输出的第 c c c类的概率值。采用标签平滑技术处理 y c y_c yc,则
y c l s = y c ( 1 − ϵ ) + ϵ / C \boxed{y_c^{ls}=y_c(1-\epsilon)+\epsilon / C} ycls=yc(1−ϵ)+ϵ/C
其中, ϵ \epsilon ϵ的取值较小,一般默认为0.1。然后,用平滑后的标签 y c l s y_c^{ls} ycls代替上述公式中的硬标签 y c y_c yc计算交叉熵。
import torch
import torch.nn as nn
import torch.nn.functional as F
class LabelSmoothingLoss(nn.Module):
def __init__(self, epsilon=0.1, reduction='mean'):
super().__init__()
self.epsilon = epsilon
self.reduction = reduction
def forward(self, logits, targets):
num_classes = logits.size(-1)
# 创建one-hot编码
one_hot = torch.zeros_like(logits)
one_hot.scatter_(1, targets.unsqueeze(1), 1) # target.scatter_(dim, index, value)
# 使用向量化公式: y_k = y_k*(1-ε) + ε/K
smoothed_targets = one_hot * (1 - self.epsilon) + self.epsilon / num_classes
# 计算损失
log_probs = -F.log_softmax(logits, dim=-1)
loss = (smoothed_targets * log_probs).sum(dim=-1)
if self.reduction == 'mean':
return loss.mean()
elif self.reduction == 'sum':
return loss.sum()
return loss
ls_fn = LabelSmoothingLoss()
ce_fn = nn.CrossEntropyLoss()
epsilon=0.1
num_classes=4
bs=10
logits = torch.randn(bs, num_classes) # 模型输出的逻辑值
targets = torch.randint(num_classes, (bs,))
loss_ls = ls_fn(logits, targets) # 输出结果:tensor(1.4299)
loss_ce = ce_fn(logits, targets) # 输出结果:tensor(1.3977))
一般来说,采用标签平滑,计算得到的损失会比交叉熵损失稍微大一点。此外,也可以直接使用timm
包中的LabelSmoothingCrossEntropy
:
import torch
from timm.loss import LabelSmoothingCrossEntropy
epsilon=0.1
num_classes=4
bs=10
lsc_fn = LabelSmoothingCrossEntropy(smoothing=epsilon)
logits = torch.randn(bs, num_classes) # 模型输出的逻辑值
targets = torch.randint(num_classes, (bs,))
loss_lsc = lsc_fn(logits, targets) # 结果与上述loss_ls一致。
总的来说,标签平滑在分类任务比较常用或者默认为交叉熵损失,对提升分类任务的性能有一定的作用。例如,在域自适应任务重,文献[2]也采用标签平滑损失,并提到:
To further increase the discriminability of the source model and facilitate the following target data alignment, we propose to adopt the label smoothing (LS) technique as it encourages examples to lie in tight evenly separated clusters。
为了提升源域模型的判别力,并促使后续目标域数据对齐,采用标签平滑技术,鼓励样本位于紧密的簇中。
参考:
[1] [1906.02629] When Does Label Smoothing Help?
[2] [2002.08546] Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation