一些改cuda加速的思路:FlashAttention、PagedAttention、LightSeq、ByteTransformer
FlashAttentionFlashAttention一般指的是FlashAttention:FastandMemory-EfficientExactAttentionwithIO-Awareness这篇,当然TransformerQualityinLinearTime这篇里非要说FLASH=FastLinearAttentionwithaSingleHead,命名有点无语,关于FLASH的细节