5 Training
This section describes the training regime for our models.
5.1 Training Data and Batching
We trained on the standard WMT 2014 English - German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte - pair encoding [3], which has a shared source - target vocabulary of about 37000 tokens. For English - French, we used the significantly larger WMT 2014 English - French dataset consisting of 36M sentences and split tokens into a 32000 word - piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
5 训练
本章节描述我们模型的训练方法。
5.1 训练数据与批处理
我们使用标准的 WMT 2014 英德数据集进行训练,该数据集包含约 450 万句对。句子采用字节对编码(BPE)进行编码,该方法使用一个共享的源语言-目标语言词汇表,词汇表中约含37,000个词元。对于英法翻译,我们使用规模更大的 WMT 2014 英法数据集,包含3600万句对,并采用32000词片的词汇表进行分词 [38]。句对按照近似的序列长度分组批处理,每个训练批次包含一组句对,其中源语言和目标语言各约 25000个词元。
【解析】
句子的主干可以简化为:we used A and split tokens into B. 其中,A代表the significantly larger WMT 2014 English - French dataset(规模更大的WMT 2014英法数据集);B代表a 32000 word - piece vocabulary(32000词片的词汇)。consisting of 36M sentences是现在分词短语做后置定语,修饰English - French dataset。这里的consist of是固定短语,表示“由…组成”。split…into…意为“把…分割成…”。
【参考翻译】
对于英法翻译,我们使用规模更大的 WMT 2014 英法数据集,包含3600万句对,并采用32000词片的词汇表进行分词。
5.2 Hardware and Schedule
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
5.2 硬件配置与训练周期
我们使用配备8块NVIDIA P100显卡的单一机器进行模型训练。基础模型采用本文所述的超参数配置,每个训练步骤耗时约0.4秒,总训练时长为100,000步(约12小时)。大型模型(对应表3末行所述配置)的单步训练时间为1.0秒,总训练量为300,000步(约3.5天)。
5.3 Optimizer
We used the Adam optimizer [20] with β 1 = 0.9 \beta_1=0.9 β1=0.9, β 2 = 0.98 \beta_2=0.98 β2=0.98 and ϵ = 10 − 9 \epsilon=10^{−9} ϵ=10−9. We varied the learning rate over the course of training, according to the formula:
l r a t e = d m o d e l − 0.5 ⋅ min ( s t e p _ n u m − 0.5 , s t e p _ n u m ⋅ w a r m u p _ s t e p s − 1.5 ) (3) lrate=d^{-0.5}_{model}\cdot\min(step\_num^{-0.5},step\_num\cdot warmup\_steps^{-1.5})\tag{3} lrate=dmodel−0.5⋅min(step_num−0.5,step_num⋅warmup_steps−1.5)(3)
This corresponds to increasing the learning rate linearly for the first w a r m u p s t e p s warmup_steps warmupsteps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used w a r m u p _ s t e p s = 4000 warmup\_steps = 4000 warmup_steps=4000.
5.3 优化器
我们采用 Adam 优化器 [20],其超参数设置为 β 1 = 0.9 \beta_1=0.9 β1=0.9, β 2 = 0.98 \beta_2=0.98 β2=0.98 and ϵ = 10 − 9 \epsilon=10^{−9} ϵ=10−9.。学习率在训练过程中按以下公式动态调整:
l r a t e = d m o d e l − 0.5 ⋅ min ( s t e p _ n u m − 0.5 , s t e p _ n u m ⋅ w a r m u p _ s t e p s − 1.5 ) (3) lrate=d^{-0.5}_{model}\cdot\min(step\_num^{-0.5},step\_num\cdot warmup\_steps^{-1.5})\tag{3} lrate=dmodel−0.5⋅min(step_num−0.5,step_num⋅warmup_steps−1.5)(3)
该调整策略表示:在初始的 w a r m u p _ s t e p s warmup\_steps warmup_steps 训练步骤中线性增加学习率,之后则按步数的平方根倒数成比例地减少。我们设定 w a r m u p _ s t e p s = 4000 warmup\_steps = 4000 warmup_steps=4000。
【解析】
这个句子可以简化为:This corresponds to A and B. 其中动词短语corresponds to是句子的谓语,and连接两个并列的宾语。corresponds to表示“相当于”。A代表“increasing the learning rate linearly for the first warmup_steps training steps”,B代表“decreasing it thereafter proportionally to the inverse square root of the step number”。A的中心词是“increasing the learning rate”,副词linearly和介词短语 for the first warmup_steps training steps and decreasing it做状语。B的中心词是“decreasing it”,副词thereafter做时间状语;proportionally to…表示“按…比例”,也是做状语;the inverse square root表示“平方根倒数”。
【参考翻译】
该调整策略表示:在初始的 warmup_steps(预热步数)训练步骤中线性增加学习率,之后则按步数的平方根倒数成比例地降低学习率。
5.4 Regularization
We employ three types of regularization during training:
Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
Residual Dropout We apply dropout [33] to the output of each sub - layer, before it is added to the sub - layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1.
Label Smoothing During training, we employed label smoothing of value ϵ l s = 0.1 \epsilon_{ls}= 0.1 ϵls=0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
5.4正则化
在训练过程中,我们采用了三种正则化方法:
表2:在英德和英法newstest2014测试中,转换器模型以更低的训练成本取得了优于以往最先进模型的BLEU分数。
残差丢弃(Residual Dropout):我们在每个子层的输出处应用丢弃操作[33],然后再将其与子层输入相加并进行归一化。此外,我们还在编码器和解码器堆栈中对词嵌入(embeddings)与位置编码(positional encodings)的和应用丢弃操作。对于基础模型,我们采用的丢弃率为 P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1。
标签平滑(Label Smoothing):在训练过程中,我们采用了值为 ϵ l s = 0.1 \epsilon_{ls}= 0.1 ϵls=0.1 的标签平滑技术 [36]。这样会降低困惑度(perplexity),因为模型在学习过程中变得更加不确定,但同时提高了准确率和 BLEU分数。
【解析】
句子的主干是:we apply dropout to the sums(我们把丢弃操作应用于这些和,或:我们对于这些和应用丢弃操作). 句首的介词短语“In addition”做状语,表示“此外,而且”。后边的介词短语of…做后置定语,修饰the sums,表示“…的”。另一个介词短语in…做状语,表示应用范围或场景。
【参考翻译】
此外,我们还在编码器和解码器堆栈中对词嵌入(embeddings)与位置编码(positional encodings)的和应用丢弃操作。
【补充说明】
困惑度(Perplexity,简称PPL)是自然语言处理(NLP)中一个常用的评估指标,尤其在语言模型(Language Model)中广泛使用。它用来衡量一个概率模型对一组文本的预测能力。数值越低,说明模型对文本的建模能力越好。
机器学习数学基础是AI方向的必读内容,本书各大平台有售,并且在本博客上有此书的补充资料。