Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS 源自于 GTC Silicon Valley-2019: cuTENSOR: High-performance Tensor Operations in CUDA,介绍了 CUTLASS 1.3 中基于 Volta Tensor Core 实现高效矩阵乘法计算的策略。主要内容为以下三点:

  • CUDA 10.1中mma.sync指令介绍;
  • Global Memory–>Shared Memory–>RF 的128 bit 访问实现;
  • Shared Memory 上的无冲突转置。

双缓冲内容缺失。

mma

无论是 slides 中的介绍还是源码实现均是采用自底向上的思路,根据硬件规格确定每个层次上的分块策略。Volta Tensor Core 计算能力是4x4x4,HMMA.884.F16.F16需要两个 Tensor Core 计算两遍。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第1张图片
CUTLASS 中封装的 mma 指令计算 m16n16k4的矩阵乘法。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第2张图片
参考 Modeling Deep Learning Accelerator Enabled GPUs 中的介绍。Warp 内四个连续线程划分为一个 threadgroup,两个 threadgroup 组成一个 octet。每个 octet 串行计算一个 Quad Pair。计算不同 QP 时是具备数据复用的,如下图所示:

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第3张图片
下图展示了 QP0中线程与数据的对应关系。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第4张图片

Permuted Shared Memory Tiles

对于全局内存上的列优先矩阵 A,每个线程加载8个元素则可以加载 m64k4的分块。然而根据前面介绍的线程和数据的映射关系,直接保存到 Shared Memory 的话,线程取用时会出现 bank 冲突。CUTLASS 中采用了一种无冲突共享内存排列来实现数据转置。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第5张图片Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第6张图片
第二组线程

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第7张图片第三组线程

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第8张图片
第四组线程

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第9张图片

Pointer Offsets For Permuted Shared Memory

Volta884Multiplicand 中定义了被乘数( A 和 B)的迭代器:

  • TileLoadIterator:从 Global Memory 循环读取数据到寄存器;
  • Volta884ThreadblockMultiplicandStoreIterator:负责 Permuted Shared Memory 的摆放,Volta884ThreadblockMultiplicandStoreIterator::ThreadOffset 与下图对应;
  • Volta884WarpMultiplicandLoadIterator:从 Shared Memory 取数据。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第10张图片

Conflict-Free Shared Memory Loads

从 Shared Memory 上加载数据到线程寄存器,仍然分为4步。前两步线程访问的数据相同,因此共计加载 32x4的 A 矩阵。Shared Memory 上的数据可供使用两次。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第11张图片
Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第12张图片
Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第13张图片Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第14张图片

Spatially Interleaved

如前所述,每个线程从 Shared Memory 读取8个元素。而执行mma指令时,每个线程提供4个元素。因此计算输出会出现空间交错。对 A 和 B 矩阵进一步分块,一次加载可以支持4次计算。

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第15张图片Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS_第16张图片

参考资料:

  • # [DOC] Where does cutlass’ detailed GEMM kernel? #526
  • Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
  • Modeling Deep Learning Accelerator Enabled GPUs
  • gpgpu-sim_distribution
  • 理解Tensor Core
  • Flexible Performant GEMM Kernels on GPUs
  • CUDA Tensor Core编程
  • PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS
  • The NVIDIA Titan V Deep Learning Deep Dive: It’s All About The Tensor Cores
  • 9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
  • Numerical Behavior of NVIDIA Tensor Cores
  • CUDA Ampere Tensor Core HGEMM 矩阵乘法优化笔记 —— Up To 131 TFLOPS!
  • If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328
  • Do bank conflicts increase when using more shared memory?
  • How does parameter computeType affect the computation?
  • 2.1.10. GEMM Algorithms Numerical Behavior
  • cuBLAS的使用
  • RAFT在Knowhere上的一些评估测试[1]
  • How does parameter computeType affect the computation?
  • cudnn-frontend/tree/main/samples/samples/conv_sample.cpp
  • Is a union in C++ actually a class?
  • A Generalized Micro-kernel Abstraction for GPU Linear Algebra
  • Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs
  • Double-buffering in shared memory, details? #227
  • Efficient GEMM in CUDA
  • Thread synchronization with syncwarp
  • Using CUDA Warp-Level Primitives
  • CUDA微架构与指令集(3)-SASS指令集分类
  • VOLTA Architecture and performance optimization
  • How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
  • Determining registers holding the data after executing LDG.E.128

你可能感兴趣的:(NVIDIA,CUTLASS,GPU,深度学习,人工智能,CUTLASS)