DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning论文解读

文章目录

  • 前言
  • 一、摘要
  • 二、引言
  • 三、贡献
    • 1. 贡献
      • 后训练:基础模型的大规模强化学习
      • 蒸馏:较小的模型也可以很强大
    • 2. 评估结果概览
      • reasoning tasks
      • knowledge
      • ohters
  • 四、方法
    • 1.Overview
    • 2.DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
      • Reinforcement Learning Algorithm(GRPO重点)
      • Reward Modeling
      • Training Template
      • Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
        • Performance of DeepSeek-R1-Zero
        • Self-evolution Process of DeepSeek-R1-Zero
        • Aha Moment of DeepSeek-R1-Zero
        • Drawback of DeepSeek-R1-Zero
    • 3.DeepSeek-R1: Reinforcement Learning with Cold Start
      • 冷启动
      • 面向推理的强化学习
      • 拒绝采样和监督微调
        • 推理性数据
        • 非推理性数据
        • 适用于所有场景的强化学习
    • 4. Distillation: Empower Small Models with Reasoning Capability
  • 五、实验
    • Benchmarks
    • Evaluation Prompts
    • Baselines
    • Evaluation Setup
    • 1.DeepSeek-R1 Evaluation
    • 2.Distilled Model Evaluation
  • 六、结论

前言

DeepSeek R1 是一款基于强化学习的大规模语言模型,因其技术突破和开源共享而备受关注。它成功复现了顶尖模型的深度推理能力,并且是开源的,让全球的研究者和开发者可以自由使用和改进。DeepSeek R1 不仅在数学、算法等领域表现出色,还展示了跨任务应用的能力,如写作等。该模型在资源有限的情况下,通过算法创新实现了高效性能,降低了使用门槛,并通过一键部署简化了接入流程。因此,DeepSeek R1 因其技术创新、易用性和广泛的适用性迅速获得了市场的高度评价,成为下载排行榜的首位,被视为可能改变全球AI竞争格局的重要产品。

论文地址:https://arxiv.org/pdf/2501.12948

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning论文解读_第1张图片

一、摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

我们介绍了第一代推理模型DeepSeek-R1-Zero和DeepSeek-R1。DeepSeek-R1-Zero是通过大规模强化学习(RL)训练的模型,没有预先进行监督微调(SFT),展示了显著的推理能力。通过RL,DeepSeek-R1-Zero自然地展现出了许多强大而有趣的推理行为。然而,它也面临着可读性差、语言混合等问题。为了解决这些问题并进一步提升推理性能,我们引入了DeepSeek-R1,它在RL之前采用了多阶段训练和冷启动数据。DeepSeek-R1在推理任务上的表现与OpenAI-o1-1217相当。为了支持研究社区,我们将DeepSeek-R1-Zero、DeepSeek-R1及其基于Qwen和Llama提炼出的六个密集型模型(1.5B, 7B, 8B, 14B, 32B, 70B)开源。

二、引言

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI).

Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-ofThought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored Various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912.

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to
as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models.

近年来,大型语言模型(LLMs)经历了由Anthropic (2024)、Google (2024) 和 OpenAI (2024a) 等机构推动的快速迭代和发展,逐渐缩小了与人工通用智能(AGI)之间的差距。最近,后训练已成为完整训练流程中的一个重要组成部分,已被证明可以在相对较少的计算资源下提高推理任务的准确性,符合社会价值观,并适应用户偏好。

最近,后训练已成为完整训练流程中的一个重要组成部分。它被证明能够增强推理任务的准确性、与社会价值观保持一致并适应用户偏好,而且相对于预训练,所需的计算资源相对较少。在推理能力方面,OpenAI的o1系列模型(OpenAI, 2024b)首次引入了通过增加思维链(Chain-of-Thought)推理过程长度的推断时扩展方法。这种方法在各种推理任务中,如数学、编程和科学推理,取得了显著的进步。然而,有效的测试时扩展仍然是研究界面临的一个开放问题。先前的一些工作已经探索了不同的方法,包括基于过程的奖励模型(Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023)、强化学习(Kumar et al., 2024),以及搜索算法如蒙特卡洛树搜索和束搜索(Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024)。然而,这些方法中没有一种能达到与OpenAI的o1系列模型相当的通用推理性能。

在这篇论文中,我们尝试使用纯强化学习(RL)来改进语言模型的推理能力。我们的目标是探索LLMs在没有任何监督数据的情况下发展推理能力的可能性,专注于它们通过纯RL过程自我进化的能力。具体来说,我们以DeepSeek-V3-Base作为基础模型,并采用GRPO(Shao et al., 2024)作为RL框架来提高模型在推理方面的表现。在训练过程中,DeepSeek-R1-Zero自然出现了许多强大的推理行为。经过数千个RL步骤后,DeepSeek-R1-Zero在推理基准测试上表现出卓越性能。例如,在AIME 2024上的pass@1得分从15.6%提高到了71.0%,并且通过多数投票,这一分数进一步提升到了86.7%,达到了与OpenAI-o1-0912相当的水平。

然而,DeepSeek-R1-Zero遇到了诸如可读性差和语言混合的问题。为了解决这些问题并进一步提升推理性能,我们引入了DeepSeek-R1,它结合了少量的冷启动数据和一个多阶段训练管道。具体来说,我们首先收集了数千个冷启动数据对DeepSeek-V3-Base模型进行了微调。随后,像DeepSeek-R1-Zero一样执行了面向推理的RL。在接近RL过程收敛时,我们通过拒绝采样在RL检查点(checkpoint:这里检查点就是权重之意)上创建了新的SFT数据,并结合来自DeepSeek-V3领域的监督数据,如写作、事实问答和自我认知,然后重新训练DeepSeek-V3-Base模型。经过新数据的微调后,检查点又经历了一轮考虑所有场景提示的RL过程。通过这些步骤,我们获得了名为DeepSeek-R1的检查点,其性能与OpenAI-o1-1217相当。

我们进一步探索了从DeepSeek-R1到较小密集型模型的知识蒸馏。以Qwen2.5-32B (Qwen, 2024b) 作为基础模型,直接从DeepSeek-R1进行知识蒸馏的表现优于在其上应用RL。这表明由较大基础模型发现的推理模式对于提升推理能力至关重要。我们开源了基于Qwen和Llama (Dubey et al., 2024) 系列的知识蒸馏模型。值得注意的是,我们开源的14B模型大幅超越了最先进的开源模型QwQ-32B-Preview (Qwen, 2024a),并且32B和70B的蒸馏模型在密集型模型的推理基准测试中设立了新纪录。

三、贡献

1. 贡献

后训练:基础模型的大规模强化学习

Post-Training: Large-Scale Reinforcement Learning on the Base Model
• We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeekR1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

  • 我们直接在基础模型上应用强化学习(RL),而不依赖监督微调(SFT)作为初步步骤。这种方法允许模型探索链式思维(CoT)以解决复杂问题,从而开发了DeepSeek-R1-Zero。DeepSeek-R1-Zero展示了自我验证、反思和生成长链式思维的能力,为研究界树立了一个重要的里程碑。值得注意的是,这是首个公开的研究,证明了LLM的推理能力可以通过纯RL激励,而无需SFT。这一突破为此领域的未来进步铺平了道路。

  • 我们介绍了开发DeepSeek-R1的管道。该管道包括两个旨在发现改进的推理模式并符合人类偏好的RL阶段,以及两个作为模型推理和非推理能力种子的SFT阶段。我们相信这个管道将通过创建更好的模型来造福行业。

蒸馏:较小的模型也可以很强大

Distillation: Smaller Models Can Be Powerful Too

• We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.

• Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeekR1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

  • 我们证明了较大模型的推理模式可以被蒸馏到较小的模型中,相比在小型模型上通过RL发现的推理模式,这带来了更好的性能。开源的DeepSeek-R1及其API将有利于研究社区在未来蒸馏出更好的小型模型。

  • 使用由DeepSeek-R1生成的推理数据,我们对研究界广泛使用的多个密集型模型进行了微调。评估结果显示,蒸馏后的较小密集型模型在基准测试中表现出色。DeepSeek-R1-Distill-Qwen-7B在AIME 2024上达到了55.5%,超过了QwQ-32B-Preview。此外,DeepSeek-R1-Distill-Qwen-32B在AIME 2024上得分72.6%,在MATH-500上得分94.3%,在LiveCodeBench上得分57.2%。这些结果显著超越了之前的开源模型,并且与o1-mini相当。我们将基于Qwen2.5和Llama3系列的1.5B、7B、8B、14B、32B和70B蒸馏检查点开源给社区。

2. 评估结果概览

1.2. Summary of Evaluation Results

reasoning tasks

• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and

你可能感兴趣的:(paper解读,DeepSeek,R1,DeepSeek,zero,大语言模型)