在2015年,何恺明(Kaiming He)等人在论文《Deep Residual Learning for Image Recognition》中提出了残差网络(ResNet),该网络在当年的ILSVRC竞赛中取得了惊人的成绩,将图像分类的Top-5错误率降至3.57%,超越了人类的识别水平。
ResNet的提出主要解决了深度神经网络中的退化问题(Degradation Problem):传统观点认为网络越深,特征提取能力越强,但实际中发现当网络深度增加到一定程度后,训练和测试准确率反而开始下降。这种现象并非由过拟合引起,而是网络优化困难导致的梯度消失/爆炸问题。
ResNet的关键创新在于引入了残差学习框架。其核心思想可以表述为:
传统网络层直接学习目标映射函数 $H(x)$,而残差网络则学习残差映射 $F(x) = H(x) - x$,最终输出 $H(x) = F(x) + x$。
这种结构被称为跳跃连接(Skip Connection)或快捷连接(Shortcut Connection),它允许梯度直接传递到更早的层,有效缓解了梯度消失问题。
ResNet的基本构建单元是残差块(Residual Block),主要有两种类型:
基本残差块(用于ResNet-18/34):
输入 → Conv(3×3) → BN → ReLU → Conv(3×3) → BN → (+输入) → ReLU → 输出
瓶颈残差块(用于ResNet-50/101/152):
输入 → Conv(1×1) → BN → ReLU → Conv(3×3) → BN → ReLU → Conv(1×1) → BN → (+输入) → ReLU → 输出
瓶颈设计中的1×1卷积用于降维和升维,大大减少了参数量和计算复杂度。
当输入与输出维度不匹配时,快捷连接通过投影映射(1×1卷积)进行维度调整。
ResNet基本架构如下:
不同深度的ResNet变体:
网络名称 | 层数 | 参数量 | 每个阶段的残差块数量 | 残差块类型 |
---|---|---|---|---|
ResNet-18 | 18 | 11.7M | [2, 2, 2, 2] | 基本 |
ResNet-34 | 34 | 21.8M | [3, 4, 6, 3] | 基本 |
ResNet-50 | 50 | 25.6M | [3, 4, 6, 3] | 瓶颈 |
ResNet-101 | 101 | 44.5M | [3, 4, 23, 3] | 瓶颈 |
ResNet-152 | 152 | 60.2M | [3, 8, 36, 3] | 瓶颈 |
以较为常用的ResNet-50为例,详细解析其架构:
输入:224×224×3图像
初始处理:
- Conv1: 7×7, 64, stride=2 → 112×112×64
- MaxPool: 3×3, stride=2 → 56×56×64
Stage 1 (56×56):
- 3×瓶颈块 [64-64-256] → 56×56×256
Stage 2 (28×28):
- 下采样瓶颈块 [256-128-512] → 28×28×512
- 3×瓶颈块 [512-128-512] → 28×28×512
Stage 3 (14×14):
- 下采样瓶颈块 [512-256-1024] → 14×14×1024
- 5×瓶颈块 [1024-256-1024] → 14×14×1024
Stage 4 (7×7):
- 下采样瓶颈块 [1024-512-2048] → 7×7×2048
- 2×瓶颈块 [2048-512-2048] → 7×7×2048
输出处理:
- 全局平均池化 → 1×1×2048
- 全连接层 → 1000类(ImageNet)
瓶颈块格式[a-b-c]表示:a是输入通道数,b是中间3×3卷积的通道数,c是输出通道数。
批量归一化(Batch Normalization):每个卷积后立即应用BN,有助于网络收敛和防止过拟合
维度匹配策略:当残差块的输入和输出维度不匹配时,采用两种方式处理:
下采样实现:空间尺寸的减半通过卷积层的步长实现,而非池化层
权重初始化:使用He初始化,保证信号的前向传播和反向传播方差稳定
2016年,何恺明等人提出了ResNet的改进版本,主要改变了激活函数的位置:
原始ResNet: 输入 → 卷积 → BN → ReLU → 卷积 → BN → (+输入) → ReLU → 输出
ResNet-V2: 输入 → BN → ReLU → 卷积 → BN → ReLU → 卷积 → (+输入) → 输出
这种"预激活"设计使得身份映射路径上没有任何变换,改善了信息流和梯度流,进一步提高了训练深层网络的能力。
ResNeXt在残差块中引入了分组卷积和cardinality概念,增加了网络的宽度而不大幅增加参数量:
┌→ 1×1 → 3×3 → 1×1 →┐
输入 → ├→ 1×1 → 3×3 → 1×1 →┤ → 求和 → 输出
└→ 1×1 → 3×3 → 1×1 →┘
这种分组卷积结构增强了特征表达能力,同时保持了计算效率。
Res2Net在更细粒度层面构建多尺度特征,通过在单个残差块内设计分层残差类似结构:
输入 → 分成s组 → 第1组直接传递
→ 第2组经过3×3卷积后与第1组求和
→ 第3组经过3×3卷积后与前两组求和
→ ...
→ 所有组连接起来 → 输出
这种设计使网络能够在多个尺度上捕获特征,提高了复杂场景的表示能力。
ResNeSt结合了ResNeXt和注意力机制的优点,引入了Split-Attention块:
特征图被分成K个组并并行处理
每个组内应用通道注意力机制
这种设计同时捕获跨特征图通道和跨空间的关系,在多种视觉任务上表现优异。
标准ResNet训练策略:
import torch
import torch.nn as nn
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(Bottleneck, self).__init__()
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=1000):
super(ResNet, self).__init__()
self.inplanes = 64
# 初始处理
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
# 四个阶段的残差块
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
# 分类头
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
# 初始化
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def _make_layer(self, block, planes, blocks, stride=1):
downsample = None
# 当需要改变维度时创建下采样层
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
nn.Conv2d(self.inplanes, planes * block.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(planes * block.expansion),
)
layers = []
# 第一个块可能需要下采样
layers.append(block(self.inplanes, planes, stride, downsample))
# 更新输入通道数
self.inplanes = planes * block.expansion
# 添加剩余的块
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes))
return nn.Sequential(*layers)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
# 创建ResNet-50
def resnet50(num_classes=1000, pretrained=False):
model = ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes)
if pretrained:
# 加载预训练权重
pass
return model
为了在资源受限设备上部署ResNet,常用的优化技术包括:
知识蒸馏:使用训练好的大模型(教师)指导小模型(学生)学习
量化:将32位浮点权重转换为8位整数或混合精度表示
剪枝:移除不重要的连接或通道,减少参数量和计算复杂度
低秩分解:将卷积核分解为更小的组件,减少计算量
ONNX转换:将模型导出为通用中间表示,便于跨平台部署
ResNet最初设计用于图像分类,在ImageNet和CIFAR等数据集上表现出色。
实例:医学图像诊断
在皮肤病变分类任务中,ResNet-50被用于区分良性和恶性黑色素瘤:
# 医学图像分类的ResNet迁移学习代码片段
import torchvision.models as models
# 加载预训练ResNet-50
model = models.resnet50(pretrained=True)
# 修改最后一层适应新任务
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 2) # 二分类:良性/恶性
# 冻结早期层,只训练后面几层
for param in list(model.parameters())[:-28]: # 只训练最后两个阶段
param.requires_grad = False
# 使用加权交叉熵处理类别不平衡
weights = torch.tensor([1.0, 3.5]) # 假设恶性样本较少
criterion = nn.CrossEntropyLoss(weight=weights)
ResNet常作为目标检测系统的骨干网络,如Faster R-CNN、SSD和YOLO等。
实例:工业质量检测
在制造业中,基于ResNet的目标检测系统用于识别产品缺陷:
# 使用ResNet构建Faster R-CNN目标检测模型
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
# 加载ResNet-50骨干网络
backbone = torchvision.models.resnet50(pretrained=True)
backbone_outputs = ['layer1', 'layer2', 'layer3', 'layer4']
# 从backbone中提取特定层作为特征图
from collections import OrderedDict
backbone = nn.Sequential(OrderedDict([
('conv1', backbone.conv1),
('bn1', backbone.bn1),
('relu', backbone.relu),
('maxpool', backbone.maxpool),
('layer1', backbone.layer1),
('layer2', backbone.layer2),
('layer3', backbone.layer3),
('layer4', backbone.layer4),
]))
# 设置特征提取器
from torchvision.models.detection.backbone_utils import BackboneWithFPN
fpn_backbone = BackboneWithFPN(
backbone,
return_layers={'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'},
in_channels_list=[256, 512, 1024, 2048],
out_channels=256
)
# 构建Faster R-CNN
anchor_generator = AnchorGenerator(
sizes=((8, 16, 32, 64, 128),),
aspect_ratios=((0.5, 1.0, 2.0),)
)
roi_pooler = torchvision.ops.MultiScaleRoIAlign(
featmap_names=['0', '1', '2', '3'],
output_size=7,
sampling_ratio=2
)
model = FasterRCNN(
fpn_backbone,
num_classes=5, # 背景 + 4种缺陷类型
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler,
min_size=600,
max_size=1000
)
ResNet作为编码器也广泛应用于语义分割网络,如FCN、PSPNet和DeepLab。
实例:遥感影像分割
在卫星图像分析中,基于ResNet的分割网络用于土地覆盖分类:
# 使用ResNet作为DeepLabv3+编码器的语义分割
import torchvision.models.segmentation as segmentation
# 加载预训练模型并自定义输出类别
model = segmentation.deeplabv3_resnet101(pretrained=True)
model.classifier[4] = nn.Conv2d(256, 7, kernel_size=1) # 7类土地覆盖
# 自定义组合损失函数
class CombinedLoss(nn.Module):
def __init__(self, weight=None):
super(CombinedLoss, self).__init__()
self.ce = nn.CrossEntropyLoss(weight=weight)
def forward(self, inputs, targets):
ce_loss = self.ce(inputs, targets)
# Dice损失计算
inputs_soft = F.softmax(inputs, dim=1)
batch_size = inputs.size(0)
dice_loss = 0
# 计算每个类别的Dice系数
for i in range(1, inputs.size(1)): # 跳过背景类
inputs_cls = inputs_soft[:, i].contiguous().view(batch_size, -1)
target_cls = (targets == i).float().view(batch_size, -1)
intersection = (inputs_cls * target_cls).sum(1)
dice_coef = (2. * intersection) / (inputs_cls.sum(1) + target_cls.sum(1) + 1e-6)
dice_loss += (1 - dice_coef).mean()
dice_loss /= (inputs.size(1) - 1)
# 组合损失
return ce_loss + dice_loss
ResNet在人脸识别领域也取得了巨大成功,特别是结合特殊损失函数。
实例:安防人脸识别
在安防监控系统中,基于ResNet的人脸识别模型用于身份验证:
# 基于ResNet的人脸识别与ArcFace损失
class ArcFaceLayer(nn.Module):
def __init__(self, in_features, out_features, s=30.0, m=0.50):
super(ArcFaceLayer, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.s = s # 缩放因子
self.m = m # 角度间隔
self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
def forward(self, x, label=None):
# L2正则化
cosine = F.linear(F.normalize(x), F.normalize(self.weight))
if label is None: # 测试阶段直接返回余弦相似度
return cosine * self.s
# 获取每个样本对应的目标类别余弦值
phi = torch.cos(torch.acos(torch.clamp(cosine[torch.arange(label.size(0)), label], -1.+1e-7, 1-1e-7)) + self.m)
# 替换原始logits中的目标类别值
one_hot = torch.zeros_like(cosine)
one_hot.scatter_(1, label.view(-1, 1).long(), 1)
output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
return output * self.s
# 人脸识别模型
class FaceRecognitionModel(nn.Module):
def __init__(self, num_classes):
super(FaceRecognitionModel, self).__init__()
# 修改ResNet的第一个卷积层,减小kernel和步长
resnet = models.resnet50(pretrained=True)
resnet.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
resnet.maxpool = nn.Identity() # 移除最大池化以保留更多细节
# 移除原始分类头
self.backbone = nn.Sequential(*list(resnet.children())[:-1])
self.embedding = nn.Linear(2048, 512)
self.arc_face = ArcFaceLayer(512, num_classes)
def forward(self, x, labels=None):
x = self.backbone(x)
x = torch.flatten(x, 1)
embedding = self.embedding(x)
if self.training:
return self.arc_face(embedding, labels)
else:
return F.normalize(embedding) # 输出单位向量用于相似度比较
ResNet的时空扩展版本在视频分析领域表现出色。
实例:零售店客户行为分析
在零售环境中,基于3D-ResNet的行为识别系统用于分析顾客行为模式:
# 3D-ResNet实现视频行为识别
class Conv3DSimple(nn.Conv3d):
def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0):
super(Conv3DSimple, self).__init__(
in_planes, out_planes, kernel_size=(3, kernel_size, kernel_size),
stride=(1, stride, stride), padding=(1, padding, padding))
class BasicBlock3D(nn.Module):
expansion = 1
def __init__(self, inplanes, planes, stride=1, downsample=None):
super(BasicBlock3D, self).__init__()
# 第一个3D卷积块
self.conv1 = Conv3DSimple(inplanes, planes, kernel_size=3, stride=stride, padding=1)
self.bn1 = nn.BatchNorm3d(planes)
self.relu = nn.ReLU(inplace=True)
# 第二个3D卷积块
self.conv2 = Conv3DSimple(planes, planes, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm3d(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
# 用于视频分析的3D-ResNet模型
class ResNet3D(nn.Module):
def __init__(self, block, layers, num_classes=400):
super(ResNet3D, self).__init__()
self.inplanes = 64
# 初始3D卷积层
self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 7, 7), stride=(1, 2, 2), padding=(1, 3, 3), bias=False)
self.bn1 = nn.BatchNorm3d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1))
# 残差块
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
# 时间注意力机制
self.temporal_pool = nn.AdaptiveAvgPool3d((None, 1, 1)) # 保留时间维度
self.temporal_attention = nn.Sequential(
nn.Conv1d(512 * block.expansion, 128, kernel_size=1),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Conv1d(128, 512 * block.expansion, kernel_size=1),
nn.Sigmoid()
)
# 分类头
self.avgpool = nn.AdaptiveAvgPool3d((1, 1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
def _make_layer(self, block, planes, blocks, stride=1):
downsample = None
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
nn.Co