参考:https://github.com/jingyaogong/minimind/tree/master
训练6个epochs
python train_pretrain.py --epochs 6
训练过程:
LLM总参数量:25.830 百万
Epoch:[1/6](0/11040) loss:8.940 lr:0.000550000000 epoch_Time:106.0min:
Epoch:[1/6](100/11040) loss:6.386 lr:0.000549997188 epoch_Time:49.0min:
Epoch:[1/6](200/11040) loss:6.219 lr:0.000549988753 epoch_Time:49.0min:
Epoch:[1/6](300/11040) loss:5.987 lr:0.000549974695 epoch_Time:48.0min:
Epoch:[1/6](400/11040) loss:5.926 lr:0.000549955014 epoch_Time:48.0min:
Epoch:[1/6](500/11040) loss:5.309 lr:0.000549929711 epoch_Time:47.0min:
Epoch:[1/6](600/11040) loss:5.165 lr:0.000549898786 epoch_Time:47.0min:
Epoch:[1/6](700/11040) loss:5.212 lr:0.000549862239 epoch_Time:46.0min:
Epoch:[1/6](800/11040) loss:5.397 lr:0.000549820073 epoch_Time:46.0min:
Epoch:[1/6](900/11040) loss:4.650 lr:0.000549772287 epoch_Time:45.0min:
Epoch:[1/6](1000/11040) loss:5.191 lr:0.000549718883 epoch_Time:45.0min:
Epoch:[1/6](1100/11040) loss:4.453 lr:0.000549659861 epoch_Time:45.0min:
Epoch:[1/6](1200/11040) loss:4.826 lr:0.000549595224 epoch_Time:44.0min:
Epoch:[1/6](1300/11040) loss:4.694 lr:0.000549524973 epoch_Time:44.0min:
Epoch:[1/6](1400/11040) loss:4.030 lr:0.000549449109 epoch_Time:43.0min:
Epoch:[1/6](1500/11040) loss:4.035 lr:0.000549367634 epoch_Time:43.0min:
......
参考:https://github.com/jingyaogong/minimind/tree/master
经历了MiniMind-V1的低质量预训练数据,导致模型胡言乱语的教训,2025-02-05
之后决定不再采用大规模无监督的数据集做预训练。
进而尝试把匠数大模型数据集的中文部分提取出来,清洗出字符<512
长度的大约1.6GB的语料直接拼接成预训练数据 pretrain_hq.jsonl
,hq即为high quality(当然也还不算high,提升数据质量无止尽)。
文件pretrain_hq.jsonl
数据格式为
{"text": "如何才能摆脱拖延症? 治愈拖延症并不容易,但以下建议可能有所帮助..."}
训练10个epoch
python train_full_sft.py --epochs 10
训练过程:
LLM总参数量:25.830 百万
Epoch:[1/10](0/9491) loss:2.481 lr:0.000055000000 epoch_Time:98.0min:
Epoch:[1/10](100/9491) loss:1.975 lr:0.000054999863 epoch_Time:43.0min:
Epoch:[1/10](200/9491) loss:1.907 lr:0.000054999452 epoch_Time:42.0min:
Epoch:[1/10](300/9491) loss:2.006 lr:0.000054998767 epoch_Time:41.0min:
Epoch:[1/10](400/9491) loss:1.852 lr:0.000054997809 epoch_Time:41.0min:
Epoch:[1/10](500/9491) loss:1.893 lr:0.000054996576 epoch_Time:40.0min:
Epoch:[1/10](600/9491) loss:1.841 lr:0.000054995070 epoch_Time:40.0min:
Epoch:[1/10](700/9491) loss:1.811 lr:0.000054993289 epoch_Time:39.0min:
Epoch:[1/10](800/9491) loss:1.818 lr:0.000054991235 epoch_Time:39.0min:
Epoch:[1/10](900/9491) loss:1.818 lr:0.000054988907 epoch_Time:38.0min:
Epoch:[1/10](1000/9491) loss:1.775 lr:0.000054986306 epoch_Time:38.0min:
Epoch:[1/10](1100/9491) loss:1.712 lr:0.000054983430 epoch_Time:38.0min:
Epoch:[1/10](1200/9491) loss:1.739 lr:0.000054980281 epoch_Time:37.0min:
Epoch:[1/10](1300/9491) loss:1.734 lr:0.000054976858 epoch_Time:37.0min:
Epoch:[1/10](1400/9491) loss:1.747 lr:0.000054973161 epoch_Time:36.0min:
Epoch:[1/10](1500/9491) loss:1.826 lr:0.000054969191 epoch_Time:36.0min:
Epoch:[1/10](1600/9491) loss:1.751 lr:0.000054964947 epoch_Time:35.0min:
Epoch:[1/10](1700/9491) loss:1.811 lr:0.000054960430 epoch_Time:35.0min:
Epoch:[1/10](1800/9491) loss:1.697 lr:0.000054955639 epoch_Time:34.0min:
Epoch:[1/10](1900/9491) loss:1.595 lr:0.000054950575 epoch_Time:34.0min:
Epoch:[1/10](2000/9491) loss:1.568 lr:0.000054945237 epoch_Time:34.0min:
Epoch:[1/10](2100/9491) loss:1.570 lr:0.000054939626 epoch_Time:33.0min:
Epoch:[1/10](2200/9491) loss:1.608 lr:0.000054933742 epoch_Time:33.0min:
Epoch:[1/10](2300/9491) loss:1.738 lr:0.000054927584 epoch_Time:32.0min:
......
训练结束后运行测试:
python eval_model.py --model_mode 1
我选择手动输入:
MiniMind模型参数量: 25.83M(illion)
[0] 自动测试
[1] 手动输入
1
: 你好
️: 您好!有什么我可以帮助您的吗?
: 你是谁
️: 作为一个AI语言模型,我没有主观意识和情感,但是我可以为您提供各种不同类型的信息和服务。
: 秦始皇是谁
️: 秦始皇是秦始皇。
: 中国第一个皇帝是 谁?
️: 中国第一个皇帝是秦始皇,他统一了六国,并建立了秦朝。
: 秦始皇的名字是是什么?
️: 中国历史上著名的“秦始皇”是指秦始皇在统一六国后,以统一文字、统一文字和统一文字而命名的。他在统一文字和统治下制定了宋朝,并在楚汉战争中取得了巨大的胜利。
可以看到有些问题回答的并不是很好。
目前参数量最小的模型26M,但是两个4090D显存就沾满了。
命令行运行
wandb login
然后输入注册wandb后给的API key
然后:
torchrun --nproc_per_node 2 train_pretrain.py --use_wandb --epochs 6
其中2表示显卡数量,–use_wandb表示使用wandb,–epochs 6表示epoch是6