Github库:GitHub - microsoft/LoRA: Code for loralib, an implementation of “LoRA: Low-Rank Adaptation of Large Language Models”
GPT-3: 175B 微调模型变得十分的贵。作者提出利用
Low-Rank Adaption
来冻结训练模型的权重,并且加入可训练的rank decomposition matrices
在transformer架构的每一个层中。✅ LoRA 通过优化表示 Dense 层权重变化的低秩分解矩阵,间接实现对 Dense 层的微调,而无需直接更新其原始权重。
✅ 将训练参数和显存需求分别缩小了10000倍以及3倍
✅ 尽管深度模型在训练时使用了大量参数(也就是过参数化),但最终模型学到的知识(即其在参数空间中的表示)实际上只占据了一个很小的有效维度(intrinsic dimension)。
❓ 为什么不直接加个模块,例如在每一层中加入一个小的瓶颈模块
这些方法最主要的缺点是需要在模型的表现和效率进行权衡,并且对于实时推理和更大规模的模型都有局限性
在更新过程中保留原有的参数不变为 W 0 W_0 W0,引入 ▽ W = B A \bigtriangledown W = BA ▽W=BA, 同时与输入 x x x 相乘得到:
h = W 0 x + B A x h = W_0 x + BAx h=W0x+BAx
其中A是随机高斯进行初始化,B初始化为0.
接着对 ▽ W \bigtriangledown W ▽W进行缩放(为了避免 ▽ W \bigtriangledown W ▽W对原模型造成影响,避免训练不稳定),
h = W 0 x + α r ⋅ B A x h = W_0 x + \frac{\alpha}{r} \cdot BAx h=W0x+rα⋅BAx
作者提出了调整了 α \alpha α就相当于调整了学习率
在Transformer的架构中,自注意力模块中有四个权重矩阵,MLP模块中有两个权重矩阵。为了更高的效率,只在自注意力模块中加入LoRA
在多任务场景中,就不能把BA融入 W W W中,融入 W W W只能实现单任务
import torch
model = AutoModelForCausalLM.from_pretrained('./deepseek-ai/deepseek-llm-7b-chat/', trust_remote_code=True, torch_dtype=torch.half, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained('./deepseek-ai/deepseek-llm-7b-chat/')
model.generation_config.pad_token_id = model.generation_config.eos_token_id
model.enable_input_require_grads()
from peft import LoraConfig, TaskType, get_peft_model
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], #目标矩阵
inference_mode=False, # 训练模式
r=8, # 分解矩阵的秩
lora_alpha=32, # 缩放因子alpha
lora_dropout=0.1
)
model = get_peft_model(model, config)
GitHub - microsoft/LoRA: Code for loralib, an implementation of “LoRA: Low-Rank Adaptation of Large Language Models”
# ===== Before =====
# layer = nn.Linear(in_features, out_features)
# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)
import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string "lora_" in their names
lora.mark_only_lora_as_trainable(model)
# ===== Before =====
# qkv_proj = nn.Linear(d_model, 3*d_model)
# ===== After =====
# Break it up (remember to modify the pretrained checkpoint accordingly)
q_proj = lora.Linear(d_model, d_model, r=8)
k_proj = nn.Linear(d_model, d_model)
v_proj = lora.Linear(d_model, d_model, r=8)
# Alternatively, use lora.MergedLinear (recommended)
qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])