强化学习中PPO(Proximal Policy Optimization)是一种先进的策略梯度方法,其核心思想是通过限制策略更新的幅度来稳定训练过程。PPO结合了优势函数(advantage function)以及off-policy 训练思想(重要性采样(importance sampling))。
推荐参考视频 零基础学习强化学习算法:ppo
策略优化目标(on-policy) 策略优化的目标是最大化期望累积奖励:
J ( θ ) = E τ ∼ π θ [ ∑ t = 0 T γ t r ( s t , a t ) ] J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r(s_t, a_t) \right] J(θ)=Eτ∼πθ[t=0∑Tγtr(st,at)]
通过策略梯度定理,我们有:
∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)=Eπθ[∇θlogπθ(at∣st)⋅At]
其中 A t = Q ( s t , a t ) − V ( s t ) A_t = Q(s_t, a_t) - V(s_t) At=Q(st,at)−V(st) 是优势函数,衡量某动作比平均策略好多少。
由于直接采样当前策略 π θ \pi_\theta πθ 可能效率低下,我们引入一个旧策略 π θ old \pi_{\theta_{\text{old}}} πθold,使用重要性采样修正偏差:
E π θ [ ⋅ ] = E π θ old [ π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) ⋅ ⋅ ] \mathbb{E}_{\pi_\theta}[\cdot] = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \cdot \cdot \right] Eπθ[⋅]=Eπθold[πθold(at∣st)πθ(at∣st)⋅⋅]
于是策略梯度变为:
∇ θ J ( θ ) = E π θ old [ π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) ⋅ ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)=Eπθold[πθold(at∣st)πθ(at∣st)⋅∇θlogπθ(at∣st)⋅At]
令比值 r t ( θ ) = π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} rt(θ)=πθold(at∣st)πθ(at∣st),则简化为:
∇ θ J ( θ ) = E π old [ r t ( θ ) ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_{\text{old}}} \left[ r_t(\theta) \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)=Eπold[rt(θ)∇θlogπθ(at∣st)⋅At]
我们可以构造如下的 surrogate objective(近似目标):
L PG ( θ ) = E t [ r t ( θ ) ⋅ A t ] L^{\text{PG}}(\theta) = \mathbb{E}_{t} \left[ r_t(\theta) \cdot A_t \right] LPG(θ)=Et[rt(θ)⋅At]
这个目标的最大化会倾向于提高概率高优势(好动作)的行为。
PPO提出一种剪切(clipped)策略目标函数,限制策略的变化范围:
Clip策略目标函数:
L CLIP ( θ ) = E t [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right] LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
解释:
完整 PPO 优化目标(含熵正则项):
为了鼓励探索,通常在目标函数中加入熵项:
L PPO ( θ ) = E t [ L CLIP ( θ ) − c 1 ( V θ ( s t ) − R t ) 2 + c 2 S [ π θ ] ( s t ) ] L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ L^{\text{CLIP}}(\theta) - c_1 (V_\theta(s_t) - R_t)^2 + c_2 S[\pi_\theta](s_t) \right] LPPO(θ)=Et[LCLIP(θ)−c1(Vθ(st)−Rt)2+c2S[πθ](st)]
各项含义:
L PPO ( θ ) = E t [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) − c 1 ( V θ ( s t ) − R t ) 2 + c 2 S [ π θ ] ( s t ) ] L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) - c_1 (V_\theta(s_t) - R_t)^2 + c_2 S[\pi_\theta](s_t) \right] LPPO(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)−c1(Vθ(st)−Rt)2+c2S[πθ](st)]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017).
Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
https://arxiv.org/abs/1707.06347
Sutton, R. S., & Barto, A. G. (2018).
Reinforcement learning: An introduction. MIT Press.
Achiam, J. (2018).
Spinning Up in Deep RL.
https://spinningup.openai.com/en/latest/algorithms/ppo.html
下面是一个基于 PyTorch 的 PPO(Proximal Policy Optimization)实现的简洁示例,使用 CartPole-v1
(离散动作空间)。该实现基于经典的 Clip-PPO 策略,适合入门理解和实验验证。
✅ 安装依赖
pip install torch gym numpy
✅ PPO PyTorch 实现(Clip 版本)
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# --- 超参数 ---
GAMMA = 0.99
EPS_CLIP = 0.2
LR = 2.5e-4
K_EPOCHS = 4
BATCH_SIZE = 64
EPISODES = 1000
UPDATE_INTERVAL = 2000 # 每多少步更新一次策略
# --- PPO 网络结构 ---
class PPO(nn.Module):
def __init__(self, state_dim, action_dim):
super(PPO, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_dim, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh()
)
self.actor = nn.Linear(64, action_dim)
self.critic = nn.Linear(64, 1)
def forward(self, x):
x = self.fc(x)
return self.actor(x), self.critic(x)
def act(self, state):
logits, value = self.forward(state)
probs = torch.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action), value
def evaluate(self, states, actions):
logits, values = self.forward(states)
probs = torch.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action_logprobs = dist.log_prob(actions)
entropy = dist.entropy()
return action_logprobs, torch.squeeze(values), entropy
# --- 经验缓冲区 ---
class Memory:
def __init__(self):
self.clear()
def clear(self):
self.states = []
self.actions = []
self.logprobs = []
self.rewards = []
self.dones = []
self.values = []
def store(self, state, action, logprob, reward, done, value):
self.states.append(state)
self.actions.append(action)
self.logprobs.append(logprob)
self.rewards.append(reward)
self.dones.append(done)
self.values.append(value)
# --- 优势计算(GAE 简化版) ---
def compute_returns_and_advantages(rewards, values, dones, gamma=GAMMA, lam=0.95):
returns = []
advs = []
gae = 0
next_value = 0
for step in reversed(range(len(rewards))):
mask = 1.0 - dones[step]
delta = rewards[step] + gamma * next_value * mask - values[step]
gae = delta + gamma * lam * mask * gae
returns.insert(0, gae + values[step])
advs.insert(0, gae)
next_value = values[step]
return torch.tensor(returns), torch.tensor(advs)
# --- 初始化 ---
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy = PPO(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=LR)
memory = Memory()
# --- 训练循环 ---
step_count = 0
state = env.reset()
for episode in range(EPISODES):
state = env.reset()
total_reward = 0
while True:
state_tensor = torch.FloatTensor(state).unsqueeze(0)
action, logprob, value = policy.act(state_tensor)
next_state, reward, done, _ = env.step(action)
memory.store(state, action, logprob, reward, done, value.item())
state = next_state
total_reward += reward
step_count += 1
if step_count % UPDATE_INTERVAL == 0:
# --- 准备训练数据 ---
states = torch.FloatTensor(memory.states)
actions = torch.LongTensor(memory.actions)
old_logprobs = torch.stack(memory.logprobs).detach()
values = torch.FloatTensor(memory.values)
returns, advantages = compute_returns_and_advantages(
memory.rewards, memory.values, memory.dones
)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# --- PPO 更新 ---
for _ in range(K_EPOCHS):
for i in range(0, len(states), BATCH_SIZE):
idx = slice(i, i + BATCH_SIZE)
batch_states = states[idx]
batch_actions = actions[idx]
batch_old_logprobs = old_logprobs[idx]
batch_returns = returns[idx]
batch_advs = advantages[idx]
logprobs, values_pred, entropy = policy.evaluate(batch_states, batch_actions)
ratio = torch.exp(logprobs - batch_old_logprobs)
surr1 = ratio * batch_advs
surr2 = torch.clamp(ratio, 1 - EPS_CLIP, 1 + EPS_CLIP) * batch_advs
actor_loss = -torch.min(surr1, surr2).mean()
critic_loss = nn.MSELoss()(values_pred, batch_returns)
entropy_bonus = -0.01 * entropy.mean()
loss = actor_loss + 0.5 * critic_loss + entropy_bonus
optimizer.zero_grad()
loss.backward()
optimizer.step()
memory.clear()
if done:
print(f"Episode {episode+1}, Reward: {total_reward}")
break
env.close()
✅ 说明
模块 | 作用 |
---|---|
Memory |
存储一段时间的经验,用于更新 |
act() |
采样动作并返回 log_prob 和 critic 值 |
evaluate() |
用于策略更新时计算 log_prob、entropy 和值函数 |
compute_returns_and_advantages() |
使用 GAE 简化形式来估计优势 |
loss |
包括剪切策略损失、值函数损失、熵奖励三部分 |
✅ 可扩展方向
gym.vector
并行多个环境matplotlib
或 TensorBoard
)重要性采样(Importance Sampling)是概率计算中常用的一种重权采样方法,用于在不能直接从目标分布采样时,通过从一个容易采样的分布中采样,并通过权重校正来估计目标期望。
设有目标分布 p ( x ) p(x) p(x),我们想要估计某函数 f ( x ) f(x) f(x) 关于 p ( x ) p(x) p(x) 的期望:
E x ∼ p [ f ( x ) ] = ∫ f ( x ) p ( x ) d x \mathbb{E}_{x \sim p}[f(x)] = \int f(x) p(x) dx Ex∼p[f(x)]=∫f(x)p(x)dx
我们做一个变换:
∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q [ f ( x ) ⋅ p ( x ) q ( x ) ] \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{x \sim q} \left[ f(x) \cdot \frac{p(x)}{q(x)} \right] ∫f(x)p(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q[f(x)⋅q(x)p(x)]
这就构成了重要性采样估计的核心:
E x ∼ p [ f ( x ) ] = E x ∼ q [ f ( x ) ⋅ w ( x ) ] , 其中 w ( x ) = p ( x ) q ( x ) 是重要性权重 \boxed{ \mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q} \left[ f(x) \cdot w(x) \right], \quad \text{其中 } w(x) = \frac{p(x)}{q(x)} \text{ 是重要性权重} } Ex∼p[f(x)]=Ex∼q[f(x)⋅w(x)],其中 w(x)=q(x)p(x) 是重要性权重
假设我们从 q ( x ) q(x) q(x) 中采样 N N N 个样本 x 1 , … , x N x_1, \dots, x_N x1,…,xN,那么:
E x ∼ p [ f ( x ) ] ≈ 1 N ∑ i = 1 N f ( x i ) ⋅ w ( x i ) \mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i) \cdot w(x_i) Ex∼p[f(x)]≈N1i=1∑Nf(xi)⋅w(xi)
其中:
w ( x i ) = p ( x i ) q ( x i ) w(x_i) = \frac{p(x_i)}{q(x_i)} w(xi)=q(xi)p(xi)
为了减小估计的方差,通常会用归一化的重要性采样:
E x ∼ p [ f ( x ) ] ≈ ∑ i = 1 N w ( x i ) ∑ j = 1 N w ( x j ) f ( x i ) \mathbb{E}_{x \sim p}[f(x)] \approx \sum_{i=1}^N \frac{w(x_i)}{\sum_{j=1}^N w(x_j)} f(x_i) Ex∼p[f(x)]≈i=1∑N∑j=1Nw(xj)w(xi)f(xi)
这种方式尤其在 不关心常数因子(如贝叶斯推断、策略优化)时非常有用。
在策略优化中,我们希望估计某个关于当前策略 π θ \pi_\theta πθ 的期望:
E π θ [ f ( s , a ) ] \mathbb{E}_{\pi_\theta} \left[ f(s, a) \right] Eπθ[f(s,a)]
但我们采样来自一个旧策略 π old \pi_{\text{old}} πold,怎么办?
用重要性采样:
E π θ [ f ( s , a ) ] = E π old [ π θ ( a ∣ s ) π old ( a ∣ s ) ⋅ f ( s , a ) ] \mathbb{E}_{\pi_\theta}[f(s, a)] = \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot f(s, a) \right] Eπθ[f(s,a)]=Eπold[πold(a∣s)πθ(a∣s)⋅f(s,a)]
其中:
L PG ( θ ) = E π old [ π θ ( a ∣ s ) π old ( a ∣ s ) ⋅ A ( s , a ) ] L^{\text{PG}}(\theta) = \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot A(s, a) \right] LPG(θ)=Eπold[πold(a∣s)πθ(a∣s)⋅A(s,a)]
这就是 PPO 中使用重要性采样修正策略偏差的核心!
缺陷 | 解释 |
---|---|
高方差 | 当 q ( x ) q(x) q(x) 与 p ( x ) p(x) p(x) 差距太大时, w ( x ) w(x) w(x) 的波动非常剧烈,导致估计不稳定。 |
归一化损失精确性 | 归一化后的估计是有偏估计,但通常偏差较小,方差更低 |
极端值敏感 | 少量样本可能具有极大权重,主导整体估计结果 |