强化学习系列——PPO算法

强化学习系列——PPO算法

  • PPO算法
    • 一、背景知识:策略梯度 & Advantage
    • 二、引入重要性采样(Importance Sampling)
    • 三、PPO-Clip 目标函数推导
    • ✅四、总结公式(一图总览)
    • 参考文献
  • PPO 示例代码实现
  • 补充内容:重要性采样
    • 一、问题背景:我们想估计某个期望
      • ❗问题:
    • 二、引入重要性采样(Importance Sampling)
    • 三、离散采样形式(蒙特卡洛估计)
    • 四、标准化的重要性采样
    • 五、在强化学习中的应用(PPO/策略梯度)
    • 六、重要性采样的缺陷
    • 七、参考文献

PPO算法

强化学习中PPO(Proximal Policy Optimization)是一种先进的策略梯度方法,其核心思想是通过限制策略更新的幅度来稳定训练过程。PPO结合了优势函数(advantage function)以及off-policy 训练思想(重要性采样(importance sampling))

推荐参考视频 零基础学习强化学习算法:ppo


一、背景知识:策略梯度 & Advantage

策略优化目标(on-policy) 策略优化的目标是最大化期望累积奖励:

J ( θ ) = E τ ∼ π θ [ ∑ t = 0 T γ t r ( s t , a t ) ] J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r(s_t, a_t) \right] J(θ)=Eτπθ[t=0Tγtr(st,at)]

通过策略梯度定理,我们有:

∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] θJ(θ)=Eπθ[θlogπθ(atst)At]

其中 A t = Q ( s t , a t ) − V ( s t ) A_t = Q(s_t, a_t) - V(s_t) At=Q(st,at)V(st)优势函数,衡量某动作比平均策略好多少。


二、引入重要性采样(Importance Sampling)

由于直接采样当前策略 π θ \pi_\theta πθ 可能效率低下,我们引入一个旧策略 π θ old \pi_{\theta_{\text{old}}} πθold,使用重要性采样修正偏差:

E π θ [ ⋅ ] = E π θ old [ π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) ⋅ ⋅ ] \mathbb{E}_{\pi_\theta}[\cdot] = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \cdot \cdot \right] Eπθ[]=Eπθold[πθold(atst)πθ(atst)]

于是策略梯度变为:

∇ θ J ( θ ) = E π θ old [ π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) ⋅ ∇ θ log ⁡ π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] θJ(θ)=Eπθold[πθold(atst)πθ(atst)θlogπθ(atst)At]

令比值 r t ( θ ) = π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} rt(θ)=πθold(atst)πθ(atst),则简化为:

∇ θ J ( θ ) = E π old [ r t ( θ ) ∇ θ log ⁡ π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\text{old}}} \left[ r_t(\theta) \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] θJ(θ)=Eπold[rt(θ)θlogπθ(atst)At]

我们可以构造如下的 surrogate objective(近似目标):

L PG ( θ ) = E t [ r t ( θ ) ⋅ A t ] L^{\text{PG}}(\theta) = \mathbb{E}_{t} \left[ r_t(\theta) \cdot A_t \right] LPG(θ)=Et[rt(θ)At]

这个目标的最大化会倾向于提高概率高优势(好动作)的行为。


三、PPO-Clip 目标函数推导

PPO提出一种剪切(clipped)策略目标函数,限制策略的变化范围:

Clip策略目标函数:

L CLIP ( θ ) = E t [ min ⁡ ( r t ( θ ) A t ,   clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]

解释:

  • r t ( θ ) r_t(\theta) rt(θ) 偏离 1 太多(即新策略变动太大),clip函数会截断其对损失的影响。
  • ϵ \epsilon ϵ 通常取值为 0.1 或 0.2。
  • 这样可以抑制策略更新过大造成的不稳定性。

完整 PPO 优化目标(含熵正则项):

为了鼓励探索,通常在目标函数中加入熵项

L PPO ( θ ) = E t [ L CLIP ( θ ) − c 1 ( V θ ( s t ) − R t ) 2 + c 2 S [ π θ ] ( s t ) ] L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ L^{\text{CLIP}}(\theta) - c_1 (V_\theta(s_t) - R_t)^2 + c_2 S[\pi_\theta](s_t) \right] LPPO(θ)=Et[LCLIP(θ)c1(Vθ(st)Rt)2+c2S[πθ](st)]

各项含义:

  • L CLIP ( θ ) L^{\text{CLIP}}(\theta) LCLIP(θ):策略更新目标(主项)
  • ( V θ ( s t ) − R t ) 2 (V_\theta(s_t) - R_t)^2 (Vθ(st)Rt)2:值函数损失(用于训练Critic)
  • S [ π θ ] ( s t ) S[\pi_\theta](s_t) S[πθ](st):策略熵(鼓励探索)
  • c 1 , c 2 c_1, c_2 c1,c2:系数(权重超参数)

✅四、总结公式(一图总览)

L PPO ( θ ) = E t [ min ⁡ ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) − c 1 ( V θ ( s t ) − R t ) 2 + c 2 S [ π θ ] ( s t ) ] L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) - c_1 (V_\theta(s_t) - R_t)^2 + c_2 S[\pi_\theta](s_t) \right] LPPO(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)c1(Vθ(st)Rt)2+c2S[πθ](st)]


参考文献

  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017).
    Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
    https://arxiv.org/abs/1707.06347

  2. Sutton, R. S., & Barto, A. G. (2018).
    Reinforcement learning: An introduction. MIT Press.

  3. Achiam, J. (2018).
    Spinning Up in Deep RL.
    https://spinningup.openai.com/en/latest/algorithms/ppo.html


PPO 示例代码实现

下面是一个基于 PyTorch 的 PPO(Proximal Policy Optimization)实现的简洁示例,使用 CartPole-v1(离散动作空间)。该实现基于经典的 Clip-PPO 策略,适合入门理解和实验验证。


✅ 安装依赖

pip install torch gym numpy

✅ PPO PyTorch 实现(Clip 版本)

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# --- 超参数 ---
GAMMA = 0.99
EPS_CLIP = 0.2
LR = 2.5e-4
K_EPOCHS = 4
BATCH_SIZE = 64
EPISODES = 1000
UPDATE_INTERVAL = 2000  # 每多少步更新一次策略

# --- PPO 网络结构 ---
class PPO(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PPO, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh()
        )
        self.actor = nn.Linear(64, action_dim)
        self.critic = nn.Linear(64, 1)

    def forward(self, x):
        x = self.fc(x)
        return self.actor(x), self.critic(x)

    def act(self, state):
        logits, value = self.forward(state)
        probs = torch.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action), value

    def evaluate(self, states, actions):
        logits, values = self.forward(states)
        probs = torch.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action_logprobs = dist.log_prob(actions)
        entropy = dist.entropy()
        return action_logprobs, torch.squeeze(values), entropy

# --- 经验缓冲区 ---
class Memory:
    def __init__(self):
        self.clear()

    def clear(self):
        self.states = []
        self.actions = []
        self.logprobs = []
        self.rewards = []
        self.dones = []
        self.values = []

    def store(self, state, action, logprob, reward, done, value):
        self.states.append(state)
        self.actions.append(action)
        self.logprobs.append(logprob)
        self.rewards.append(reward)
        self.dones.append(done)
        self.values.append(value)

# --- 优势计算(GAE 简化版) ---
def compute_returns_and_advantages(rewards, values, dones, gamma=GAMMA, lam=0.95):
    returns = []
    advs = []
    gae = 0
    next_value = 0
    for step in reversed(range(len(rewards))):
        mask = 1.0 - dones[step]
        delta = rewards[step] + gamma * next_value * mask - values[step]
        gae = delta + gamma * lam * mask * gae
        returns.insert(0, gae + values[step])
        advs.insert(0, gae)
        next_value = values[step]
    return torch.tensor(returns), torch.tensor(advs)

# --- 初始化 ---
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy = PPO(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=LR)
memory = Memory()

# --- 训练循环 ---
step_count = 0
state = env.reset()
for episode in range(EPISODES):
    state = env.reset()
    total_reward = 0

    while True:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action, logprob, value = policy.act(state_tensor)

        next_state, reward, done, _ = env.step(action)
        memory.store(state, action, logprob, reward, done, value.item())
        state = next_state
        total_reward += reward
        step_count += 1

        if step_count % UPDATE_INTERVAL == 0:
            # --- 准备训练数据 ---
            states = torch.FloatTensor(memory.states)
            actions = torch.LongTensor(memory.actions)
            old_logprobs = torch.stack(memory.logprobs).detach()
            values = torch.FloatTensor(memory.values)
            returns, advantages = compute_returns_and_advantages(
                memory.rewards, memory.values, memory.dones
            )
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

            # --- PPO 更新 ---
            for _ in range(K_EPOCHS):
                for i in range(0, len(states), BATCH_SIZE):
                    idx = slice(i, i + BATCH_SIZE)
                    batch_states = states[idx]
                    batch_actions = actions[idx]
                    batch_old_logprobs = old_logprobs[idx]
                    batch_returns = returns[idx]
                    batch_advs = advantages[idx]

                    logprobs, values_pred, entropy = policy.evaluate(batch_states, batch_actions)
                    ratio = torch.exp(logprobs - batch_old_logprobs)

                    surr1 = ratio * batch_advs
                    surr2 = torch.clamp(ratio, 1 - EPS_CLIP, 1 + EPS_CLIP) * batch_advs
                    actor_loss = -torch.min(surr1, surr2).mean()
                    critic_loss = nn.MSELoss()(values_pred, batch_returns)
                    entropy_bonus = -0.01 * entropy.mean()

                    loss = actor_loss + 0.5 * critic_loss + entropy_bonus

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            memory.clear()

        if done:
            print(f"Episode {episode+1}, Reward: {total_reward}")
            break

env.close()

✅ 说明

模块 作用
Memory 存储一段时间的经验,用于更新
act() 采样动作并返回 log_prob 和 critic 值
evaluate() 用于策略更新时计算 log_prob、entropy 和值函数
compute_returns_and_advantages() 使用 GAE 简化形式来估计优势
loss 包括剪切策略损失、值函数损失、熵奖励三部分

✅ 可扩展方向

  • ✔️ 支持连续动作空间(用 Gaussian 分布)
  • ✔️ 引入 GAE(lambda)
  • ✔️ 使用 gym.vector 并行多个环境
  • ✔️ 保存模型、可视化训练曲线(如用 matplotlibTensorBoard

补充内容:重要性采样

重要性采样(Importance Sampling)是概率计算中常用的一种重权采样方法,用于在不能直接从目标分布采样时,通过从一个容易采样的分布中采样,并通过权重校正来估计目标期望。


一、问题背景:我们想估计某个期望

设有目标分布 p ( x ) p(x) p(x),我们想要估计某函数 f ( x ) f(x) f(x) 关于 p ( x ) p(x) p(x) 的期望:

E x ∼ p [ f ( x ) ] = ∫ f ( x ) p ( x ) d x \mathbb{E}_{x \sim p}[f(x)] = \int f(x) p(x) dx Exp[f(x)]=f(x)p(x)dx

❗问题:

  • 可能我们不能从 p ( x ) p(x) p(x) 直接采样;
  • 但我们可以从另一个分布 q ( x ) q(x) q(x)(称为提议分布 / sampling distribution)采样。

二、引入重要性采样(Importance Sampling)

我们做一个变换:

∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q [ f ( x ) ⋅ p ( x ) q ( x ) ] \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{x \sim q} \left[ f(x) \cdot \frac{p(x)}{q(x)} \right] f(x)p(x)dx=f(x)q(x)p(x)q(x)dx=Exq[f(x)q(x)p(x)]

这就构成了重要性采样估计的核心

E x ∼ p [ f ( x ) ] = E x ∼ q [ f ( x ) ⋅ w ( x ) ] , 其中  w ( x ) = p ( x ) q ( x )  是重要性权重 \boxed{ \mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q} \left[ f(x) \cdot w(x) \right], \quad \text{其中 } w(x) = \frac{p(x)}{q(x)} \text{ 是重要性权重} } Exp[f(x)]=Exq[f(x)w(x)],其中 w(x)=q(x)p(x) 是重要性权重


三、离散采样形式(蒙特卡洛估计)

假设我们从 q ( x ) q(x) q(x) 中采样 N N N 个样本 x 1 , … , x N x_1, \dots, x_N x1,,xN,那么:

E x ∼ p [ f ( x ) ] ≈ 1 N ∑ i = 1 N f ( x i ) ⋅ w ( x i ) \mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i) \cdot w(x_i) Exp[f(x)]N1i=1Nf(xi)w(xi)

其中:

w ( x i ) = p ( x i ) q ( x i ) w(x_i) = \frac{p(x_i)}{q(x_i)} w(xi)=q(xi)p(xi)


四、标准化的重要性采样

为了减小估计的方差,通常会用归一化的重要性采样

E x ∼ p [ f ( x ) ] ≈ ∑ i = 1 N w ( x i ) ∑ j = 1 N w ( x j ) f ( x i ) \mathbb{E}_{x \sim p}[f(x)] \approx \sum_{i=1}^N \frac{w(x_i)}{\sum_{j=1}^N w(x_j)} f(x_i) Exp[f(x)]i=1Nj=1Nw(xj)w(xi)f(xi)

这种方式尤其在 不关心常数因子(如贝叶斯推断、策略优化)时非常有用。


五、在强化学习中的应用(PPO/策略梯度)

在策略优化中,我们希望估计某个关于当前策略 π θ \pi_\theta πθ 的期望:

E π θ [ f ( s , a ) ] \mathbb{E}_{\pi_\theta} \left[ f(s, a) \right] Eπθ[f(s,a)]

但我们采样来自一个旧策略 π old \pi_{\text{old}} πold,怎么办?

用重要性采样:

E π θ [ f ( s , a ) ] = E π old [ π θ ( a ∣ s ) π old ( a ∣ s ) ⋅ f ( s , a ) ] \mathbb{E}_{\pi_\theta}[f(s, a)] = \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot f(s, a) \right] Eπθ[f(s,a)]=Eπold[πold(as)πθ(as)f(s,a)]

其中:

  • w ( s , a ) = π θ ( a ∣ s ) π old ( a ∣ s ) w(s, a) = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} w(s,a)=πold(as)πθ(as) 是重要性权重;
  • 在 PPO 中, f ( s , a ) f(s,a) f(s,a) 通常是优势函数 A ( s , a ) A(s,a) A(s,a),从而目标函数是:

L PG ( θ ) = E π old [ π θ ( a ∣ s ) π old ( a ∣ s ) ⋅ A ( s , a ) ] L^{\text{PG}}(\theta) = \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot A(s, a) \right] LPG(θ)=Eπold[πold(as)πθ(as)A(s,a)]

这就是 PPO 中使用重要性采样修正策略偏差的核心!


六、重要性采样的缺陷

缺陷 解释
高方差 q ( x ) q(x) q(x) p ( x ) p(x) p(x) 差距太大时, w ( x ) w(x) w(x) 的波动非常剧烈,导致估计不稳定。
归一化损失精确性 归一化后的估计是有偏估计,但通常偏差较小,方差更低
极端值敏感 少量样本可能具有极大权重,主导整体估计结果

七、参考文献

  1. Rubinstein, R. Y., & Kroese, D. P. (2016).
    Simulation and the Monte Carlo Method (3rd ed.). Wiley.

你可能感兴趣的:(算法,深度学习,算法,人工智能)