太不够就不要看了。使用一些模型之前还是多了解自己的显卡够不够格。
其实关键就是使用deepspeed以及各种参数调整。以下是使用deepspeed的命令,记一笔为了以后改参数。
deepspeed --num_gpus 2 \
/mcm/LLaMA-Factory/src/train.py --deepspeed \
/mcm/LLaMA-Factory/examples/deepspeed/ds_z3_config.json \
--stage sft \
--model_name_or_path \
/mcm/Meta-Llama-3-8B-Instruct \
--do_train \
--dataset identity,adgen_local,alpaca_gpt4_zh \
--dataset_dir /mcm/all_about_testing/llamaFactoryLesson/data \
--template llama3 \
--finetuning_type full \
--output_dir ./saves/llama3-8b/lora/full \
--overwrite_cache \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 500 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss --bf16
回头用deepseek试试,肯定省不少硬件资源。
问了deepseek,微调deepseek这种情况还能用accelerate
2025-04-27:RTX3090 24G一张配俩2696V4,用accelerate调出来之后,发现怎么加载都是在用cpu跑,才知道的这种用了分载(offload)微调出来的模型,要先merge好,导出cpu上的一些数据,才能以显卡作为主要驱动去跑webchat。
跑出来一张loss rate,平滑后还是呈现锯齿形下滑,就知道这事儿没做好,另外loss rate最后也太高了。查了下,还是各种调优参数不行,还是继续炼丹吧。
相关不完美配置先放一下,其实看着运行速度,以及疗效,真的不太想在这个配置的硬件上训14B了,好像也就8B的水平,有点伤心。后期再看看。因为merge完的微调模型,还是慢,占用资源有点太多。
accelerate相关,用accelerate config命令,有个wizard会帮助走完的,再手动改几个点,就行了。
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 8
#gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
#zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: CUDAGRAPHS
dynamo_mode: default
dynamo_use_dynamic: true
dynamo_use_fullgraph: false
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
llamafactory-cli train时候的微调yaml:
一般上来经常想训一个identity,建议单独训epoch给它搞高一些,比如50。可以纯用gpu训,不用accelerate,训完后,拿训好的,再训点别的。
### model
model_name_or_path: /mcm/downloaded_data_batch/complete_trained_model/DeepSeek-R1-Distill-Qwen-14B
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
#lora_alpha: 16
#lora_target: all
#quantization_method: bitsandbytes
#quantization_device_map: auto
#double_quantization: false
quantization_bit: 8 #必须滴,不然 cuda OOM
#vllm_gpu_util: 1
### dataset
dataset: adgen_local,alpaca_gpt4_zh
template: deepseek3
cutoff_len: 1024
max_samples: 3000
overwrite_cache: true
preprocessing_num_workers: 60
dataloader_num_workers: 20
### output
output_dir: ./saves/dp-sk/lora/sft
logging_steps: 10
save_steps: 100
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
#max_grad_norm: 1.0
#learning_rate: 1.0e-4
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 20
#warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
flash_attn: fa2
llamafactory-cli export 需要用的合并yaml:
### Note: DO NOT use quantized model or quantization_bit when merging lora adapters
### model
model_name_or_path: /mcm/downloaded_data_batch/complete_trained_model/DeepSeek-R1-Distill-Qwen-14B
adapter_name_or_path: /mcm/all_about_testing/llamaFactoryLesson/saves/dp-sk/lora/sft
template: qwen
finetuning_type: lora
trust_remote_code: true
### export
export_dir: /mcm/all_about_testing/llamaFactoryLesson/merged_output/qwen2_lora_sft
export_size: 10 #随便写的,一般好像大于2
export_device: cpu # choices: [cpu, auto]
export_legacy_format: false