IREE::Util::createSimplifyGlobalAccessesPass
这个pass主要做这几件事:
global tensor
的 load 提前到了 block 的开头,将global tensor
的 store 安全地挪到 block 的结尾。load after store
,则把 load 直接替换成 store 的 source。比如,store %0, @p
%1 = load @p
return %1
转换成,store %0, @p
return %0
store after store
,则直接消除前一个 storestore %0, @p
store %1, @p
转换成,store %1, @p
load after load
,则消除后一个 load%0 = load @p
%1 = load @p
return %1
转换成,%0 = load @p
return %0
IREE::Util::createApplyPatternsPass
执行IREE::Util dialect ODS
中定义的Canonicalization Patterns
,并执行 block 和跳转命令参数化简操作。
br ^bb1(%0, %0 : index, index)
^bb1(%arg0: index, %arg1: index):
...
折叠相同的参数,化简为
br ^bb1(%0 : index)
^bb1(%arg0: index): // %arg1 remapped to %arg0
...
func.func @foo(%arg0: index) {
br ^bb1(%arg0 : index)
^bb1(%0: index):
...
}
消除参数后,
func.func @foo(%arg0: index) {
br ^bb1
^bb1: // %0 remapped to %arg0
...
}
IREE::Util::createFoldGlobalsPass
这个 pass 继续对global tensor
的 load 和 store 操作进行优化,主要包括:
util.global mutable @a : i32
func.func @fool {
%c5 = arith.constant 5 : i32
util.global.store %c5, @a : i32
return
}
转换成,
util.global @a = 5 : i32
util.global @a = 5 : i32
func.func @fool {
%1 = util.global.load @a : i32
...
}
转换成,
func.func @fool {
%1 = arith.constant 5 : i32
...
}
global tensor
。mutable global tensor
只在 init 函数中被 store 过,则将它修改为 immutable。global tensor
。immutable global tensor
IREE::Flow::createTensorPadToTensorInsertSlicePass
将tensor.pad
转换为linalg.fill + tensor.insert_slice
。
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
%padded = tensor.pad %0 low[1, 2] high[3, 4] {
^bb0(%arg1: index, %arg2: index):
tensor.yield %cst : f32
} : tensor<1x1xf32> to tensor<5x7xf32>
%1 = hal.tensor.export %padded : tensor<5x7xf32> -> !hal.buffer_view
return %1 : !hal.buffer_view
}
转换为,
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
%1 = tensor.empty() : tensor<5x7xf32>
%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<5x7xf32>) -> tensor<5x7xf32>
%inserted_slice = tensor.insert_slice %0 into %2[1, 2] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<5x7xf32>
%3 = hal.tensor.export %inserted_slice : tensor<5x7xf32> -> !hal.buffer_view
return %3 : !hal.buffer_view
}
mlir::createConvertElementwiseToLinalgPass
把 elementwise 算子(带有Elementwise traits
的 op)转换成linalg generic op
,方便后续对elementwise op
做算子融合。arith dialect
和math dialect
的 op 都是 Elementwise 的,所以实际上这个 pass 会把arith dialect
和math dialect lower
到linalg dialect
。
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
%1 = arith.addf %0, %0 : tensor<2x3xf32>
%2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
return %2 : !hal.buffer_view
}
转换成,
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0, %0 : tensor<2x3xf32>, tensor<2x3xf32>) outs(%0 : tensor<2x3xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%3 = arith.addf %in, %in_0 : f32
linalg.yield %3 : f32
} -> tensor<2x3xf32>
%2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
return %2 : !hal.buffer_view
}
mlir::createLinalgFoldUnitExtentDimsPass
消除长度为 的维度或者循环。
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<1x3xf32>) outs(%0 : tensor<1x3xf32>) {
^bb0(%in: f32, %out: f32):
%3 = arith.addf %in, %in : f32
linalg.yield %3 : f32
} -> tensor<1x3xf32>
%2 = hal.tensor.export %1 : tensor<1x3xf32> -> !hal.buffer_view
return %2 : !hal.buffer_view
}
转换成,
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
%collapsed = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
%collapsed_0 = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
%1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%collapsed : tensor<3xf32>) outs(%collapsed_0 : tensor<3xf32>) {
^bb0(%in: f32, %out: f32):
%3 = arith.addf %in, %in : f32
linalg.yield %3 : f32
} -> tensor<3xf32>
%expanded = tensor.expand_shape %1 [[0, 1]] : tensor<3xf32> into tensor<1x3xf32>
%2 = hal.tensor.export %expanded : tensor<1x3xf32> -> !hal.buffer_view
return %2 : !hal.buffer_view
}
linalg.generic
由 2 层循环缩减成了单层循环
createInterchangeGenericOpsPass
循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。
// sum(%arg0: tensor<2x3xf32>, 0) -> tensor<3xf32>
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
%1 = tensor.empty() : tensor<3xf32>
%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>], iterator_types = ["reduction", "parallel"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
^bb0(%in: f32, %out: f32):
%5 = arith.addf %in, %out : f32
linalg.yield %5 : f32
} -> tensor<3xf32>
%4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
return %4 : !hal.buffer_view
}
交换循环之后转换成,
func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
%1 = tensor.empty() : tensor<3xf32>
%2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
^bb0(%in: f32, %out: f32):
%5 = arith.addf %in, %out : f32
linalg.yield %5 : f32
} -> tensor<3xf32>
%4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
return %4 : !hal.buffer_view
}
memref::createResolveShapedTypeResultDimsPass
mlir::createCanonicalizerPass
mlir::createCSEPass
createFusionOfTensorOpsPass
主要做 elementwise 的算子融合,其次也会将tensor.expand_shape
转换成linalg generic op
,方便进行算子融合。
elementwise 算子融合的条件:
linalg generic op
,且都为 tensor 语义。// reduce(mul(arg0, arg1), 0)
// for (int d0 = 0; d0 < n; ++d0) {
// temp[d0] = arg0[d0] * arg1[d0];
// }
// result = 0;
// for (int d0 = 0; d0 < n; ++d0) {
// result += temp[d0];
// }
func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
%2 = tensor.empty() : tensor<2xf32>
%3 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%2 : tensor<2xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%8 = arith.mulf %in, %in_0 : f32
linalg.yield %8 : f32
} -> tensor<2xf32>
%4 = tensor.empty() : tensor<f32>
%5 = linalg.fill ins(%cst : f32) outs(%4 : tensor<f32>) -> tensor<f32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%3 : tensor<2xf32>) outs(%5 : tensor<f32>) {
^bb0(%in: f32, %out: f32):
%8 = arith.addf %in, %out : f32
linalg.yield %8 : f32
} -> tensor<f32>
%7 = hal.tensor.export %6 : tensor<f32> -> !hal.buffer_view
return %7 : !hal.buffer_view
}
融合mul和reduce之后转换成,
// result = 0;
// for (int d0 = 0; d0 < n; ++d0) {
// result += arg0[d0] * arg1[d0];
// }
func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
%2 = tensor.empty() : tensor<f32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<f32>) -> tensor<f32>
%4 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%3 : tensor<f32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%6 = arith.mulf %in, %in_0 : f32
%7 = arith.addf %6, %out : f32
linalg.yield %7 : f32
} -> tensor<f32>
%5 = hal.tensor.export %4 : tensor<f32> -> !hal.buffer_view
return %5 : !hal.buffer_view
}
mlir::createLinalgDetensorizePass
将 0-D Tensor 转换为它的基础元素类型。
mlir::createCanonicalizerPass
mlir::createCSEPass
createSplitReductionPass
将 matmul 和 topk 的单次 reduce 分成两次 reduce 操作(一次 batch matmul 和一次 add)。默认不开启,设置--iree-flow-split-matmul-reduction>=2
可开启。
func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
%2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
%4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
%5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
return %5 : !hal.buffer_view
}
--iree-flow-split-matmul-reduction=2
转换成,
func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
%2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
%4 = tensor.expand_shape %0 [[0], [1, 2]] : tensor<128x256xf32> into tensor<128x2x128xf32>
%5 = tensor.expand_shape %1 [[0, 1], [2]] : tensor<256x256xf32> into tensor<2x128x256xf32>
%6 = linalg.init_tensor [2, 128, 256] : tensor<2x128x256xf32>
%7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<2x128x256xf32>) -> tensor<2x128x256xf32>
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<128x2x128xf32>, tensor<2x128x256xf32>) outs(%7 : tensor<2x128x256xf32>) attrs = {__internal_linalg_transform__ = "SPLIT", linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]} {
^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
%11 = arith.mulf %arg2, %arg3 : f32
%12 = arith.addf %arg4, %11 : f32
linalg.yield %12 : f32
} -> tensor<2x128x256xf32>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>], iterator_types = ["reduction", "parallel", "parallel"]} ins(%8 : tensor<2x128x256xf32>) outs(%3 : tensor<128x256xf32>) attrs = {__internal_linalg_transform__ = "SPLIT"} {
^bb0(%arg2: f32, %arg3: f32):
%11 = arith.addf %arg2, %arg3 : f32
linalg.yield %11 : f32
} -> tensor<128x256xf32>
%10 = hal.tensor.export %9 : tensor<128x256xf32> -> !hal.buffer_view
return %10 : !hal.buffer_view
}
createInterchangeGenericOpsPass
循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。
createInterchangeTransposeGenericOpsPass
当输入 indexing map 是 permutation 时,交换循环维度使得输入的 indexing map 是 identity 的,其作用是使得输入尽可能变成连续访存。
createDispatchWithTransformDialect
根据transform dialect
对算子进行调度和派遣,需要另外加载一个transform dialect
的 module 文件,默认不做该变换。transform dialect
定义了一套调度规则,用于引导目标 IR 进行变换,比如循环展开、tiling 等。
createFormDispatchRegionsPass
以包含reduction loop
的linalg op
或named linalg op
为中心(root),按一定规则合并 producers 和 comsumers,划分出dispatch region
子图。dispatch region
是 IREE 中的原子执行单元,dispatch region
内部可以直接复用输入和输出的内存,从而避免了内部的内存分配操作,内存分配只发生在dispatch region
的边界,同时dispatch region
之间会自动插入同步操作。
func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
%2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
%3 = tensor.empty() : tensor<2x5xf32>
%4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
%5 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%8 = arith.addf %in, %in_0 : f32
linalg.yield %8 : f32
} -> tensor<2x5xf32>
%7 = hal.tensor.export %6 : tensor<2x5xf32> -> !hal.buffer_view
return %7 : !hal.buffer_view
}
转换成,
func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
%1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
%2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
%3 = tensor.empty() : tensor<2x5xf32>
%4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c1_0 = arith.constant 1 : index
%5 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0, %c2, %c1_0]
%c0_1 = arith.constant 0 : index
%c5 = arith.constant 5 : index
%c1_2 = arith.constant 1 : index
%6 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1, %c5, %c1_2]
%7 = flow.dispatch.region[%5, %6] -> (tensor<2x5xf32>) {
%9 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%9, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
^bb0(%in: f32, %in_3: f32, %out: f32):
%11 = arith.addf %in, %in_3 : f32
linalg.yield %11 : f32
} -> tensor<2x5xf32>
flow.return %10 : tensor<2x5xf32>
} count(%arg3: index, %arg4: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
flow.return %x, %y, %z : index, index, index
}
%8 = hal.tensor.export %7 : tensor<2x5xf32> -> !hal.buffer_view
return %8 : !hal.buffer_view
}
createFormDispatchWorkgroupsPass
将dispatch region
转换成dispatch work group
的形式,并将 cloneable 的 op(比如tensor.fill
、tensor.empty
等)拷贝到 work group 中。如果在linalg
层做了tiling
,该 pass 也会把tiling
引入的tensor.extract_slice
和tensor.insert_slice
尽可能转换成flow.tensor.slice和flow.tensor.update
,转换不了的后续再转换成flow.dispatch.tensor.load
和flow.dispatch.tensor.store
createCaptureDispatchDynamicDimsPass
由于flow.dispatch.workgroups
的参数中动态形状 tensor 被替换成了!flow.dispatch.tensor
和相应的动态维度 index,该 pass 捕获 workgroups 参数中的动态维度 index,插入flow.dispatch.tie_shape
将参数中的动态维度 index 和!flow.dispatch.tensor
进行绑定。
mlir::createCanonicalizerPass
createCSEPass
createInitializeEmptyTensorsPass
如果tensor.empty op
的 user 中存在非 linalg 或 IREE LinalgExt op,则把该tensor.empty op
转换成flow.tensor.empty
或flow.tensor.splat op
。
IREE::Flow::createOutlineDispatchRegionsPass
把每个dispatch region
转换成flow.executable + flow.dispatch op
。
IREE::Util::createStripDebugOpsPass
消除DebugOnly op。
mlir::createCanonicalizerPass
IREE::Flow::createDeduplicateExecutablesPass
消除重复的flow.executable
。
IREE::Flow::createInjectDispatchTracingPass
注入跟踪运行时 dispatch 函数输入和输出信息的 op。默认不开启。
IREE::Flow::createCleanupTensorShapesPass
删除flow.tensor.tie_shape op
,并确认 module 中不再包含tensor.dim
和tensor.rank
这两类形状查询 op。
mlir::createCanonicalizerPass
mlir::createCSEPass
mlir::createCanonicalizerPass
mlir::createCSEPass
mlir::createSymbolDCEPass
未完待续…