昇腾设备torch_npu推理配置

1. Ascend 310B1的npu推理思路

  在昇腾 Ascend 310B1 NPU 上基于 PyTorch 进行推理时,通过 torch_npu 替换原有 GPU/CUDA 操作。

  torch_npu的技术参考文档:pytorch: Ascend Extension for PyTorch


2. 推理过程中可能遇到的问题和解决方案

1. NPU 设备不支持 double 数据类型

  • 错误日志中提示:

    Warning: Device do not support double dtype now, dtype cast repalce with float.

       这表明 NPU 设备(Ascend 310B1)不支持 double(即 float64)数据类型,因此系统自动将其转换为 float(即 float32)。

  • 解决方法
    确保所有张量的数据类型为 float32,而不是 double。可以在创建张量时显式指定数据类型:

    x = torch.randn(2, 3, dtype=torch.float32).to(device)
    y = torch.randn(2, 3, dtype=torch.float32).to(device)

    2. 环境变量未配置正确

  • 错误日志中提示:

    ImportError: libhccl.so: cannot open shared object file: No such file or directory

       libhccl.so 文件缺失的问题是一般是因为环境变量配置。在运行包含torch_npu的代码之前需要运行环境配置命令。

  • 解决方法
      在运行实际python文件之前需要在命令行运行以下命令:

  •  3. NPU 设备不支持某些算子

  • 错误日志中提示:

    RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is MaxPoolWithArgmaxV1.

       问题的原因可能是 MaxPoolWithArgmaxV1在当前的 CANN 工具包版本中不被支持。

  • 解决方法
      如果某个算子不被支持,可以尝试使用功能相似的其他算子来替代。MaxPool2d 不被支持,可以尝试使用AvgPool2d代替

  • 错误日志中提示:

    [W compiler_depend.ts:387] Warning: E40021: Failed to compile Op [DropOutDoMask]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/7.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/drop_out_do_mask.py failed with errormsg/stack: File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/tbe/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
        raise get_last_ffi_error()
    

       问题的原因可能是DropOutDoMask,在当前的 CANN 工具包版本中不被支持。

  • 解决方法
      可以尝试手动实现较为简单的算子dropout

  • # 自定义的 DropoutLayer
    class DropoutLayer(nn.Module):
        def __init__(self, p=0.5):
            super(DropoutLayer, self).__init__()
            self.p = p
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            if self.training:  # 仅在训练时应用 dropout
                # 生成 mask (均匀分布随机数,超过 p 的位置保持)
                mask = (torch.rand_like(x) > self.p).float()
                # 将被丢弃的元素设为 0,并缩放剩下的部分
                return x * mask / (1 - self.p)
            return x  # 在评估模式下直接返回输入

3. cpu和npu实际性能表现

1. Alexnet

模型结构如下:

------------------------------------------------------------------
1-Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(1, 1)) 

------------------------------------------------------------------
2-Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) 

------------------------------------------------------------------
3-Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) 

------------------------------------------------------------------
4-Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) 

------------------------------------------------------------------
5-Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) 

------------------------------------------------------------------
6-Linear(in_features=9216, out_features=4096, bias=True) 

------------------------------------------------------------------
7-Linear(in_features=4096, out_features=4096, bias=True) 

------------------------------------------------------------------
8-Linear(in_features=4096, out_features=1000, bias=True) 
设备\模块 1 2 3 4 5 6 7 8 总计(ms)
CPU 5.349 8.966 5.186 5.349 4.071 186.742 81.339 9.404 306.406
NPU 0.456 0.922 3.265 4.786 2.954 13.936 7.239 1.836 35.394

 2. lenet

模型结构如下:

------------------------------------------------------------------
1-Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) 

------------------------------------------------------------------
2-Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1)) 

------------------------------------------------------------------
3-Linear(in_features=46656, out_features=1024, bias=True) 

------------------------------------------------------------------
4-Linear(in_features=1024, out_features=1024, bias=True) 

------------------------------------------------------------------
5-Linear(in_features=1024, out_features=1000, bias=True) 
设备\模块 1 2 3 4 5 总计(ms)
CPU 6.490 4.001 226.313 2.632 1.814 241.250
NPU 0.434 0.245 11.871 0.446 0.416 13.412

  3. resnet

模型结构如下:

------------------------------------------------------------------
1-Sequential(
  (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): AvgPool2d(kernel_size=3, stride=2, padding=1)
) 

------------------------------------------------------------------
2-BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
3-BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
4-BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
5-BasicBlock(
  (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
    (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
    (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
) 

------------------------------------------------------------------
6-BasicBlock(
  (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
7-BasicBlock(
  (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
8-BasicBlock(
  (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
9-BasicBlock(
  (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
    (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
) 

------------------------------------------------------------------
10-BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
11-BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
12-BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
13-BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
14-BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
15-BasicBlock(
  (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
    (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
) 

------------------------------------------------------------------
16-BasicBlock(
  (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
17-BasicBlock(
  (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
) 

------------------------------------------------------------------
18-Sequential(
  (0): AdaptiveAvgPool2d(output_size=(1, 1))
  (1): Flatten(start_dim=1, end_dim=-1)
  (2): Linear(in_features=512, out_features=1000, bias=True)
) 



设备\模块 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 总计(ms)
CPU 16.252 16.860 16.917 17.021 16.515 15.840 15.983 16.053 22.174 24.503 26.659 24.864 25.237 25.101 63.421 57.389 57.276 1.259 458.324
NPU 3.313 1.062 0.993 0.979 1.633 1.520 1.499 1.509 4.218 5.088 5.087 5.074 5.075 5.087 16.119 21.140 21.282 0.364 101.042

    你可能感兴趣的:(pytorch,深度学习,人工智能)