本文档旨在深入阐述深度学习中Pad Token的技术原理、实现机制及工程应用,为算法工程师提供全面的理论指导和实践参考。
研究问题: 为什么深度学习模型需要将变长序列统一到固定长度?
技术挑战: 如何在保证计算效率的同时,避免填充操作对模型性能的负面影响?
现代GPU基于SIMD(Single Instruction, Multiple Data)架构,该架构的核心特征:
内存布局示例:
不规则数据(无法并行):
Address: 0x1000 [101, 102, 103] <- 长度3
Address: 0x100C [104, 105] <- 长度2
Address: 0x1014 [106, 107, 108, 109] <- 长度4
规则数据(支持并行):
Address: 0x2000 [101, 102, 103, 0] <- 统一长度4
Address: 0x2010 [104, 105, 0, 0] <- 统一长度4
Address: 0x2020 [106, 107, 108, 109] <- 统一长度4
神经网络的核心运算遵循矩阵代数规则:
定理: 矩阵乘法 A × B 要求 A 的列数等于 B 的行数
设输入序列矩阵: X ∈ R^(batch_size × seq_len)
设权重矩阵: W ∈ R^(seq_len × hidden_dim)
计算: Y = X × W ∈ R^(batch_size × hidden_dim)
约束条件: X的seq_len必须等于W的第一维度
定义: 批处理张量运算
设批次B = {x₁, x₂, ..., xₙ},其中xᵢ ∈ R^(lᵢ × d)
目标: 构造统一张量 X ∈ R^(n × L × d),其中L = max(l₁, l₂, ..., lₙ)
填充函数: pad(xᵢ, L) = [xᵢ; 0_{(L-lᵢ)×d}]
策略类型 | 实现方式 | 适用场景 | 代码示例 |
---|---|---|---|
后置填充 | 序列尾部添加PAD | 因果语言模型(GPT) | [w1, w2, w3, PAD, PAD] |
前置填充 | 序列头部添加PAD | 生成任务对齐 | [PAD, PAD, w1, w2, w3] |
双向填充 | 两端对称添加PAD | 中心对齐需求 | [PAD, w1, w2, w3, PAD] |
静态填充:
# 固定最大长度
MAX_LENGTH = 512
padded_sequence = sequence + [PAD] * (MAX_LENGTH - len(sequence))
动态填充:
# 批次内最大长度
batch_max_len = max(len(seq) for seq in batch)
padded_batch = [seq + [PAD] * (batch_max_len - len(seq)) for seq in batch]
掩码函数定义:
mask(attention_scores, mask_matrix) = {
attention_scores[i,j] if mask_matrix[i,j] = 1
-∞ if mask_matrix[i,j] = 0
}
Softmax归一化后效果:
softmax(-∞) = 0 → 填充位置权重为零
import torch
import torch.nn.functional as F
class MaskedAttention:
@staticmethod
def apply_mask(attention_scores, attention_mask):
"""
Args:
attention_scores: (batch_size, seq_len, seq_len)
attention_mask: (batch_size, seq_len) - 1表示有效,0表示填充
"""
# 扩展mask维度以匹配attention_scores
mask_expanded = attention_mask.unsqueeze(1).expand_as(attention_scores)
# 将填充位置设为负无穷
masked_scores = attention_scores.masked_fill(
mask_expanded == 0,
float('-inf')
)
# 应用softmax
attention_weights = F.softmax(masked_scores, dim=-1)
return attention_weights
Token类型 | 符号表示 | 功能描述 | 使用场景 |
---|---|---|---|
PAD Token | [PAD] |
序列填充,无语义 | 批处理对齐 |
CLS Token | [CLS] |
分类任务,序列表示 | BERT分类 |
SEP Token | [SEP] |
序列分隔符 | 句对任务 |
EOS Token | [EOS] |
序列结束标志 | 生成任务 |
UNK Token | [UNK] |
未知词汇 | 词汇表外词 |
问题: PAD Token与EOS Token混用导致的语义冲突
# ❌ 错误配置
tokenizer.pad_token = tokenizer.eos_token
# 问题场景:生成任务中的误解释
input_text = "Hello world"
# 编码:[101, 102, 103, 2, 2] # 2表示EOS/PAD
# 模型理解:Hello world [结束] [结束] → 提前停止生成
解决方案: 独立Token设计
# ✅ 正确配置
class TokenizerConfig:
def setup_special_tokens(self, tokenizer):
special_tokens = {
'pad_token': '[PAD]',
'unk_token': '[UNK]',
'cls_token': '[CLS]',
'sep_token': '[SEP]',
'eos_token': '[EOS]'
}
# 确保每个特殊token都有独立的ID
tokenizer.add_special_tokens(special_tokens)
return tokenizer
class LengthAwareBatcher:
def __init__(self, bucket_boundaries=[32, 64, 128, 256, 512]):
self.bucket_boundaries = bucket_boundaries
self.buckets = {boundary: [] for boundary in bucket_boundaries}
def add_sequence(self, sequence, metadata=None):
"""将序列添加到合适的桶中"""
seq_len = len(sequence)
# 找到最小的合适桶
target_bucket = next(
(boundary for boundary in self.bucket_boundaries if boundary >= seq_len),
self.bucket_boundaries[-1] # 超长序列使用最大桶
)
self.buckets[target_bucket].append({
'sequence': sequence,
'original_length': seq_len,
'metadata': metadata
})
def get_batches(self, batch_size=32):
"""生成优化的批次"""
batches = []
for bucket_size, sequences in self.buckets.items():
if not sequences:
continue
# 按批次大小分组
for i in range(0, len(sequences), batch_size):
batch = sequences[i:i+batch_size]
batches.append({
'bucket_size': bucket_size,
'sequences': batch,
'padding_efficiency': self._calculate_efficiency(batch, bucket_size)
})
return batches
def _calculate_efficiency(self, batch, bucket_size):
"""计算填充效率:有效token数 / 总token数"""
total_tokens = len(batch) * bucket_size
effective_tokens = sum(item['original_length'] for item in batch)
return effective_tokens / total_tokens if total_tokens > 0 else 0
class AdaptiveLengthManager:
def __init__(self, target_efficiency=0.85, window_size=1000):
self.target_efficiency = target_efficiency
self.window_size = window_size
self.length_history = []
self.efficiency_history = []
def update_statistics(self, batch_lengths, actual_length):
"""更新长度统计和效率记录"""
efficiency = sum(batch_lengths) / (len(batch_lengths) * actual_length)
self.length_history.extend(batch_lengths)
self.efficiency_history.append(efficiency)
# 维持窗口大小
if len(self.length_history) > self.window_size:
self.length_history = self.length_history[-self.window_size//2:]
if len(self.efficiency_history) > self.window_size//10:
self.efficiency_history = self.efficiency_history[-self.window_size//20:]
def get_optimal_length(self, current_batch_lengths):
"""计算当前批次的最优长度"""
if not self.length_history:
return max(current_batch_lengths)
import numpy as np
# 基于历史数据的百分位数
historical_optimal = np.percentile(self.length_history, 90)
# 当前批次的需求
current_max = max(current_batch_lengths)
# 效率约束调整
recent_efficiency = np.mean(self.efficiency_history[-10:]) if self.efficiency_history else 1.0
if recent_efficiency < self.target_efficiency:
# 效率不足,倾向于选择更紧凑的长度
optimal_length = min(historical_optimal, current_max * 1.1)
else:
# 效率良好,可以适当放宽
optimal_length = max(historical_optimal, current_max)
return int(optimal_length)
import torch
class InPlacePadding:
@staticmethod
def pad_tensor_inplace(tensor_list, target_length, pad_value=0):
"""原地填充,减少内存分配"""
batch_size = len(tensor_list)
# 预分配目标张量
device = tensor_list[0].device
dtype = tensor_list[0].dtype
if len(tensor_list[0].shape) == 1:
# 1D张量
result = torch.full(
(batch_size, target_length),
pad_value,
dtype=dtype,
device=device
)
else:
# 多维张量
*other_dims, seq_len = tensor_list[0].shape
result = torch.full(
(batch_size, target_length, *other_dims),
pad_value,
dtype=dtype,
device=device
)
# 复制原始数据
for i, tensor in enumerate(tensor_list):
seq_len = tensor.shape[-1] if len(tensor.shape) > 1 else len(tensor)
result[i, :seq_len] = tensor
return result
class PaddingMemoryPool:
def __init__(self, max_batch_size=64, max_seq_length=512):
self.max_batch_size = max_batch_size
self.max_seq_length = max_seq_length
self.tensor_cache = {}
def get_padded_tensor(self, batch_size, seq_length, dtype=torch.long, device='cpu'):
"""从内存池获取或创建填充张量"""
cache_key = (batch_size, seq_length, dtype, device)
if cache_key not in self.tensor_cache:
# 创建新的填充张量
self.tensor_cache[cache_key] = torch.zeros(
batch_size, seq_length,
dtype=dtype,
device=device
)
return self.tensor_cache[cache_key]
def clear_cache(self):
"""清理内存池"""
self.tensor_cache.clear()
torch.cuda.empty_cache() # 清理GPU缓存
class DynamicBatchSizer:
def __init__(self, target_memory_mb=8192, base_batch_size=32):
self.target_memory_mb = target_memory_mb
self.base_batch_size = base_batch_size
self.memory_measurements = []
def estimate_memory_usage(self, seq_length, hidden_dim, num_layers):
"""估算内存使用量(MB)"""
# 基于Transformer架构的经验公式
memory_per_token = (
hidden_dim * 4 + # embedding + 3 linear projections
hidden_dim * hidden_dim * 3 / 1000 + # attention matrices
hidden_dim * 4 # FFN layers
) * num_layers
return seq_length * memory_per_token * 4 / (1024 * 1024) # 4 bytes per float32
def get_optimal_batch_size(self, seq_length, model_config):
"""计算最优批次大小"""
memory_per_sample = self.estimate_memory_usage(
seq_length,
model_config.hidden_dim,
model_config.num_layers
)
if memory_per_sample == 0:
return self.base_batch_size
optimal_batch_size = int(self.target_memory_mb / memory_per_sample)
# 确保批次大小在合理范围内
return max(1, min(optimal_batch_size, self.base_batch_size * 4))
import multiprocessing as mp
from functools import partial
class ParallelPadding:
@staticmethod
def pad_sequence_chunk(sequence_chunk, target_length, pad_value):
"""并行处理序列块"""
return [
seq + [pad_value] * (target_length - len(seq))
for seq in sequence_chunk
]
def parallel_pad_sequences(self, sequences, num_workers=None):
"""多进程并行填充"""
if num_workers is None:
num_workers = mp.cpu_count()
target_length = max(len(seq) for seq in sequences)
chunk_size = len(sequences) // num_workers
if chunk_size == 0:
# 序列太少,直接处理
return self.pad_sequence_chunk(sequences, target_length, 0)
# 分块处理
chunks = [
sequences[i:i+chunk_size]
for i in range(0, len(sequences), chunk_size)
]
# 并行执行
with mp.Pool(num_workers) as pool:
pad_func = partial(
self.pad_sequence_chunk,
target_length=target_length,
pad_value=0
)
results = pool.map(pad_func, chunks)
# 合并结果
return [seq for chunk_result in results for seq in chunk_result]
class GradientCheckpointedPadding(torch.nn.Module):
def __init__(self, max_seq_length=512):
super().__init__()
self.max_seq_length = max_seq_length
def forward(self, input_ids, attention_mask):
"""使用梯度检查点的填充前向传播"""
def create_padded_representations(input_ids, attention_mask):
# 只在需要时计算,节省显存
batch_size, seq_len = input_ids.shape
if seq_len < self.max_seq_length:
# 动态填充到所需长度
pad_length = self.max_seq_length - seq_len
input_ids = torch.cat([
input_ids,
torch.zeros(batch_size, pad_length, dtype=input_ids.dtype, device=input_ids.device)
], dim=1)
attention_mask = torch.cat([
attention_mask,
torch.zeros(batch_size, pad_length, dtype=attention_mask.dtype, device=attention_mask.device)
], dim=1)
return input_ids, attention_mask
# 使用检查点减少显存占用
return torch.utils.checkpoint.checkpoint(
create_padded_representations,
input_ids,
attention_mask,
use_reentrant=False
)
问题1: 填充导致计算浪费严重
诊断指标:
def calculate_padding_efficiency(batch):
"""计算填充效率"""
total_positions = batch['input_ids'].numel()
valid_positions = batch['attention_mask'].sum().item()
efficiency = valid_positions / total_positions
print(f"填充效率: {efficiency:.2%}")
print(f"浪费的计算量: {(1-efficiency):.2%}")
return efficiency
解决方案:
问题2: 显存占用过高
解决方案:
class MemoryEfficientBatch:
def __init__(self, gradient_accumulation_steps=4):
self.grad_accum_steps = gradient_accumulation_steps
def process_large_batch(self, large_batch, model):
"""通过梯度累积处理大批次"""
total_loss = 0
effective_batch_size = len(large_batch) // self.grad_accum_steps
for i in range(self.grad_accum_steps):
start_idx = i * effective_batch_size
end_idx = (i + 1) * effective_batch_size
mini_batch = large_batch[start_idx:end_idx]
# 处理小批次
outputs = model(mini_batch)
loss = outputs.loss / self.grad_accum_steps
loss.backward()
total_loss += loss.item()
return total_loss
问题3: PAD Token影响模型输出
检测方法:
def detect_pad_token_leakage(model_outputs, attention_mask):
"""检测PAD Token对输出的影响"""
# 获取填充位置
pad_positions = (attention_mask == 0)
# 检查填充位置的输出是否为零
if pad_positions.any():
pad_outputs = model_outputs[pad_positions]
non_zero_pads = (pad_outputs != 0).any(dim=-1)
if non_zero_pads.any():
print(f"警告: {non_zero_pads.sum()}个填充位置产生了非零输出")
return False
return True
修复方案:
class PadTokenLeakageFix:
@staticmethod
def zero_pad_positions(outputs, attention_mask):
"""强制将填充位置的输出置零"""
pad_mask = (attention_mask == 0).unsqueeze(-1)
return outputs.masked_fill(pad_mask, 0.0)
@staticmethod
def apply_strong_masking(attention_scores, attention_mask):
"""应用强掩码,确保填充位置权重为零"""
mask = attention_mask.unsqueeze(1).unsqueeze(2)
attention_scores = attention_scores.masked_fill(mask == 0, -1e10)
return attention_scores
import matplotlib.pyplot as plt
import seaborn as sns
class PaddingVisualizer:
@staticmethod
def plot_length_distribution(sequence_lengths, bins=50):
"""可视化序列长度分布"""
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(sequence_lengths, bins=bins, alpha=0.7, edgecolor='black')
plt.xlabel('序列长度')
plt.ylabel('频次')
plt.title('序列长度分布')
plt.subplot(1, 2, 2)
plt.boxplot(sequence_lengths)
plt.ylabel('序列长度')
plt.title('序列长度箱线图')
plt.tight_layout()
plt.show()
@staticmethod
def plot_padding_efficiency(batch_efficiencies):
"""可视化填充效率"""
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.plot(batch_efficiencies, marker='o')
plt.xlabel('批次编号')
plt.ylabel('填充效率')
plt.title('批次填充效率趋势')
plt.subplot(1, 2, 2)
plt.hist(batch_efficiencies, bins=20, alpha=0.7)
plt.xlabel('填充效率')
plt.ylabel('批次数量')
plt.title('填充效率分布')
plt.tight_layout()
plt.show()
class PaddingPerformanceMonitor:
def __init__(self):
self.metrics = {
'batch_count': 0,
'total_tokens': 0,
'valid_tokens': 0,
'padding_ratios': [],
'processing_times': []
}
def log_batch(self, input_ids, attention_mask, processing_time):
"""记录批次处理指标"""
batch_size, seq_len = input_ids.shape
total_tokens = batch_size * seq_len
valid_tokens = attention_mask.sum().item()
padding_ratio = 1 - (valid_tokens / total_tokens)
self.metrics['batch_count'] += 1
self.metrics['total_tokens'] += total_tokens
self.metrics['valid_tokens'] += valid_tokens
self.metrics['padding_ratios'].append(padding_ratio)
self.metrics['processing_times'].append(processing_time)
def get_summary(self):
"""生成性能摘要"""
import numpy as np
overall_efficiency = self.metrics['valid_tokens'] / self.metrics['total_tokens']
avg_padding_ratio = np.mean(self.metrics['padding_ratios'])
avg_processing_time = np.mean(self.metrics['processing_times'])
return {
'overall_efficiency': overall_efficiency,
'average_padding_ratio': avg_padding_ratio,
'average_processing_time': avg_processing_time,
'total_batches': self.metrics['batch_count'],
'total_tokens_processed': self.metrics['total_tokens']
}
技术背景: 传统padding方式在处理极长序列时效率低下
解决方案: FlashAttention等技术
# 伪代码:可变长度注意力
def variable_length_attention(queries, keys, values, sequence_lengths):
"""支持可变长度的高效注意力计算"""
# 使用累积偏移量索引
cumulative_lengths = torch.cumsum(torch.cat([torch.zeros(1), sequence_lengths[:-1]]))
# 分块计算注意力,避免padding
attention_outputs = []
for i, (start, length) in enumerate(zip(cumulative_lengths, sequence_lengths)):
end = start + length
q_i = queries[start:end]
k_i = keys[start:end]
v_i = values[start:end]
# 计算当前序列的注意力
attn_i = efficient_attention(q_i, k_i, v_i)
attention_outputs.append(attn_i)
return attention_outputs
概念: 根据输入序列长度动态构建计算图
class AdaptiveComputationGraph:
def __init__(self, base_model):
self.base_model = base_model
self.computation_cache = {}
def forward(self, input_sequences):
"""根据序列长度分组处理"""
# 按长度分组
length_groups = self._group_by_length(input_sequences)
results = {}
for length, sequences in length_groups.items():
# 为每个长度组构建专门的计算图
if length not in self.computation_cache:
self.computation_cache[length] = self._build_computation_graph(length)
computation_graph = self.computation_cache[length]
results[length] = computation_graph(sequences)
return self._merge_results(results)
技术方向: FPGA/ASIC专用填充处理单元
class HardwareAcceleratedPadding:
"""模拟硬件加速填充操作"""
def __init__(self, device_type='fpga'):
self.device_type = device_type
self.parallel_units = 1024 if device_type == 'fpga' else 128
def accelerated_pad(self, sequences, target_length):
"""硬件加速填充"""
if self.device_type == 'fpga':
return self._fpga_parallel_pad(sequences, target_length)
else:
return self._gpu_optimized_pad(sequences, target_length)
def _fpga_parallel_pad(self, sequences, target_length):
"""FPGA并行填充实现"""
# 模拟FPGA的流水线处理
chunk_size = len(sequences) // self.parallel_units
processed_chunks = []
for i in range(0, len(sequences), chunk_size):
chunk = sequences[i:i+chunk_size]
# 硬件并行处理
padded_chunk = self._hardware_pad_chunk(chunk, target_length)
processed_chunks.extend(padded_chunk)
return processed_chunks
class ContentAwarePadding:
def __init__(self, model_embeddings):
self.embeddings = model_embeddings
self.semantic_pad_tokens = self._generate_semantic_pads()
def _generate_semantic_pads(self):
"""生成语义相关的填充token"""
# 基于词嵌入空间的聚类
from sklearn.cluster import KMeans
# 获取词汇表嵌入
vocab_embeddings = self.embeddings.weight.data.cpu().numpy()
# 聚类生成语义填充token
kmeans = KMeans(n_clusters=10)
clusters = kmeans.fit(vocab_embeddings)
# 每个聚类生成一个语义填充token
semantic_pads = []
for center in clusters.cluster_centers_:
# 找到最接近聚类中心的实际token
distances = np.linalg.norm(vocab_embeddings - center, axis=1)
closest_token = np.argmin(distances)
semantic_pads.append(closest_token)
return semantic_pads
def intelligent_pad(self, sequence, target_length, context=None):
"""基于内容的智能填充"""
if len(sequence) >= target_length:
return sequence[:target_length]
# 分析序列的语义类别
sequence_embedding = self.embeddings(torch.tensor(sequence)).mean(dim=0)
# 选择最合适的语义填充token
pad_similarities = []
for pad_token in self.semantic_pad_tokens:
pad_embedding = self.embeddings(torch.tensor([pad_token]))
similarity = torch.cosine_similarity(
sequence_embedding.unsqueeze(0),
pad_embedding,
dim=1
)
pad_similarities.append(similarity.item())
best_pad_token = self.semantic_pad_tokens[np.argmax(pad_similarities)]
# 执行填充
padding_needed = target_length - len(sequence)
return sequence + [best_pad_token] * padding_needed
import torch
import numpy as np
from typing import List, Dict, Any, Optional, Union
from dataclasses import dataclass
@dataclass
class PaddingConfig:
"""填充配置类"""
max_length: int = 512
padding_side: str = 'right' # 'left', 'right'
pad_token_id: int = 0
truncation: bool = True
return_attention_mask: bool = True
return_tensors: str = 'pt' # 'pt', 'np', 'list'
class UniversalPadding:
"""通用填充工具类"""
def __init__(self, config: PaddingConfig):
self.config = config
self.statistics = {
'total_sequences': 0,
'total_tokens': 0,
'padding_tokens': 0,
'truncated_sequences': 0
}
def pad_sequences(
self,
sequences: List[List[int]],
max_length: Optional[int] = None
) -> Dict[str, Union[torch.Tensor, List]]:
"""
填充序列到统一长度
Args:
sequences: 输入序列列表
max_length: 目标长度,None时使用批次内最大长度
Returns:
包含填充结果和attention_mask的字典
"""
if max_length is None:
max_length = min(
max(len(seq) for seq in sequences),
self.config.max_length
)
padded_sequences = []
attention_masks = []
for sequence in sequences:
# 更新统计信息
self.statistics['total_sequences'] += 1
self.statistics['total_tokens'] += len(sequence)
# 处理超长序列
if len(sequence) > max_length:
if self.config.truncation:
sequence = sequence[:max_length]
self.statistics['truncated_sequences'] += 1
else:
raise ValueError(f"序列长度 {len(sequence)} 超过最大长度 {max_length}")
# 计算需要填充的长度
padding_length = max_length - len(sequence)
self.statistics['padding_tokens'] += padding_length
# 执行填充
if self.config.padding_side == 'right':
padded_seq = sequence + [self.config.pad_token_id] * padding_length
attention_mask = [1] * len(sequence) + [0] * padding_length
else: # left padding
padded_seq = [self.config.pad_token_id] * padding_length + sequence
attention_mask = [0] * padding_length + [1] * len(sequence)
padded_sequences.append(padded_seq)
attention_masks.append(attention_mask)
# 转换为指定格式
result = {
'input_ids': self._convert_to_tensor(padded_sequences),
}
if self.config.return_attention_mask:
result['attention_mask'] = self._convert_to_tensor(attention_masks)
return result
def _convert_to_tensor(self, data: List[List[int]]):
"""转换数据格式"""
if self.config.return_tensors == 'pt':
return torch.tensor(data, dtype=torch.long)
elif self.config.return_tensors == 'np':
return np.array(data, dtype=np.int64)
else:
return data
def get_padding_statistics(self) -> Dict[str, Any]:
"""获取填充统计信息"""
total_positions = self.statistics['total_sequences'] * self.config.max_length
if total_positions > 0:
efficiency = (self.statistics['total_tokens'] -
self.statistics['truncated_sequences'] * self.config.max_length) / total_positions
else:
efficiency = 0.0
return {
'total_sequences': self.statistics['total_sequences'],
'total_tokens': self.statistics['total_tokens'],
'padding_tokens': self.statistics['padding_tokens'],
'truncated_sequences': self.statistics['truncated_sequences'],
'padding_efficiency': efficiency,
'average_sequence_length': (
self.statistics['total_tokens'] / self.statistics['total_sequences']
if self.statistics['total_sequences'] > 0 else 0
)
}
import time
import matplotlib.pyplot as plt
from memory_profiler import profile
class PaddingBenchmark:
"""填充性能测试套件"""
def __init__(self):
self.results = {}
def generate_test_data(self, num_sequences: int, length_range: tuple = (10, 500)) -> List[List[int]]:
"""生成测试数据"""
sequences = []
for _ in range(num_sequences):
length = np.random.randint(length_range[0], length_range[1])
sequence = np.random.randint(1, 1000, length).tolist()
sequences.append(sequence)
return sequences
def benchmark_padding_methods(self, test_sizes: List[int] = [100, 500, 1000, 5000]):
"""对比不同填充方法的性能"""
methods = {
'native_padding': self._native_padding,
'torch_pad_sequence': self._torch_pad_sequence,
'optimized_padding': self._optimized_padding
}
for size in test_sizes:
print(f"\n测试序列数量: {size}")
test_data = self.generate_test_data(size)
for method_name, method_func in methods.items():
# 性能测试
start_time = time.time()
result = method_func(test_data)
end_time = time.time()
execution_time = end_time - start_time
# 记录结果
if method_name not in self.results:
self.results[method_name] = {'sizes': [], 'times': []}
self.results[method_name]['sizes'].append(size)
self.results[method_name]['times'].append(execution_time)
print(f" {method_name}: {execution_time:.4f}s")
def _native_padding(self, sequences):
"""原生Python填充"""
max_len = max(len(seq) for seq in sequences)
return [seq + [0] * (max_len - len(seq)) for seq in sequences]
def _torch_pad_sequence(self, sequences):
"""PyTorch pad_sequence"""
from torch.nn.utils.rnn import pad_sequence
tensor_sequences = [torch.tensor(seq) for seq in sequences]
return pad_sequence(tensor_sequences, batch_first=True, padding_value=0)
def _optimized_padding(self, sequences):
"""优化的填充实现"""
config = PaddingConfig(max_length=512, return_tensors='pt')
padder = UniversalPadding(config)
return padder.pad_sequences(sequences)
def plot_performance_results(self):
"""绘制性能对比图"""
plt.figure(figsize=(12, 8))
for method_name, data in self.results.items():
plt.plot(data['sizes'], data['times'], marker='o', label=method_name)
plt.xlabel('序列数量')
plt.ylabel('执行时间 (秒)')
plt.title('不同填充方法性能对比')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
@profile
def memory_usage_test(self, num_sequences: int = 1000):
"""内存使用测试"""
test_data = self.generate_test_data(num_sequences)
# 测试不同方法的内存使用
config = PaddingConfig(max_length=512, return_tensors='pt')
padder = UniversalPadding(config)
result = padder.pad_sequences(test_data)
return result
Attention Is All You Need (Vaswani et al., 2017)
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)
FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al., 2022)