强化学习是机器学习的一个重要分支,也是近年来深度学习取得重大突破的领域之一,从围棋到机器人控制,都有它的身影。
今天我们将分为两个主要部分:
强化学习涉及一个智能体(agent)与环境(environment)交互,通过尝试和错误来学习最优策略。让我们先明确一些基本概念:
强化学习问题通常被形式化为马尔可夫决策过程,它包含以下要素:
在MDP中,当前状态和动作只取决于前一个状态,这就是"马尔可夫性质"。
有两种主要的价值函数:
状态价值函数 V π ( s ) V^\pi(s) Vπ(s): 在策略 π \pi π 下,从状态 s s s 开始的期望回报:
V π ( s ) = E π [ ∑ t = 0 ∞ γ t r t ∣ s 0 = s ] V^\pi(s) = \mathbb{E}_\pi[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s] Vπ(s)=Eπ[∑t=0∞γtrt∣s0=s]
状态-动作价值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a): 在策略 π \pi π 下,从状态 s s s 开始,执行动作 a a a 后的期望回报:
Q π ( s , a ) = E π [ ∑ t = 0 ∞ γ t r t ∣ s 0 = s , a 0 = a ] Q^\pi(s,a) = \mathbb{E}_\pi[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s, a_0=a] Qπ(s,a)=Eπ[∑t=0∞γtrt∣s0=s,a0=a]
这两个函数之间的关系是:
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) Q π ( s , a ) V^\pi(s) = \sum_{a \in A} \pi(a|s) Q^\pi(s,a) Vπ(s)=∑a∈Aπ(a∣s)Qπ(s,a)
贝尔曼方程是强化学习理论的核心,它表达了当前状态的价值与下一个状态价值之间的递归关系。
对于状态价值函数,贝尔曼方程为:
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P ( s ′ ∣ s , a ) [ R ( s , a , s ′ ) + γ V π ( s ′ ) ] V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')] Vπ(s)=∑a∈Aπ(a∣s)∑s′∈SP(s′∣s,a)[R(s,a,s′)+γVπ(s′)]
对于状态-动作价值函数,贝尔曼方程为:
Q π ( s , a ) = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) [ R ( s , a , s ′ ) + γ ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^\pi(s,a) = \sum_{s' \in S} P(s'|s,a) [R(s,a,s') + \gamma \sum_{a' \in A} \pi(a'|s') Q^\pi(s',a')] Qπ(s,a)=∑s′∈SP(s′∣s,a)[R(s,a,s′)+γ∑a′∈Aπ(a′∣s′)Qπ(s′,a′)]
如果我们知道最优策略 π ∗ \pi^* π∗,我们就可以定义最优价值函数:
V ∗ ( s ) = max π V π ( s ) V^*(s) = \max_\pi V^\pi(s) V∗(s)=maxπVπ(s)
Q ∗ ( s , a ) = max π Q π ( s , a ) Q^*(s,a) = \max_\pi Q^\pi(s,a) Q∗(s,a)=maxπQπ(s,a)
对应的贝尔曼最优方程为:
V ∗ ( s ) = max a ∈ A ∑ s ′ ∈ S P ( s ′ ∣ s , a ) [ R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ] V^*(s) = \max_{a \in A} \sum_{s' \in S} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')] V∗(s)=maxa∈A∑s′∈SP(s′∣s,a)[R(s,a,s′)+γV∗(s′)]
Q ∗ ( s , a ) = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) [ R ( s , a , s ′ ) + γ max a ′ ∈ A Q ∗ ( s ′ , a ′ ) ] Q^*(s,a) = \sum_{s' \in S} P(s'|s,a) [R(s,a,s') + \gamma \max_{a' \in A} Q^*(s',a')] Q∗(s,a)=∑s′∈SP(s′∣s,a)[R(s,a,s′)+γmaxa′∈AQ∗(s′,a′)]
一旦我们知道了 Q ∗ Q^* Q∗,最优策略就很容易得到: π ∗ ( s ) = arg max a ∈ A Q ∗ ( s , a ) \pi^*(s) = \arg\max_{a \in A} Q^*(s,a) π∗(s)=argmaxa∈AQ∗(s,a)
Q-learning是一种时序差分学习方法,它直接学习最优的Q函数,而不需要知道环境的转移概率和奖励函数。
根据贝尔曼最优方程,我们有:
Q ∗ ( s , a ) = ∑ s ′ ∈ S P ( s ′ ∣ s , a ) [ R ( s , a , s ′ ) + γ max a ′ ∈ A Q ∗ ( s ′ , a ′ ) ] Q^*(s,a) = \sum_{s' \in S} P(s'|s,a) [R(s,a,s') + \gamma \max_{a' \in A} Q^*(s',a')] Q∗(s,a)=∑s′∈SP(s′∣s,a)[R(s,a,s′)+γmaxa′∈AQ∗(s′,a′)]
由于我们不知道 P ( s ′ ∣ s , a ) P(s'|s,a) P(s′∣s,a) 和 R ( s , a , s ′ ) R(s,a,s') R(s,a,s′),我们通过经验来采样这些值,并使用随机梯度下降来更新Q值:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]
其中:
这就是Q-learning的核心更新公式。
下面让我们实现一个简单的Q-learning算法来解决OpenAI Gym中的出租车问题(Taxi-v3):
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
import time
import seaborn as sns
# 创建出租车环境
env = gym.make("Taxi-v3")
# 初始化Q表
action_size = env.action_space.n
state_size = env.observation_space.n
q_table = np.zeros((state_size, action_size))
# 超参数
alpha = 0.1 # 学习率
gamma = 0.99 # 折扣因子
epsilon = 1.0 # 探索率
epsilon_min = 0.1 # 最小探索率
epsilon_decay = 0.995 # 探索率衰减
episodes = 1000 # 回合数
# 训练日志
rewards = []
epsilons = []
# 训练Q表
for episode in range(episodes):
# 重置环境
state, _ = env.reset()
done = False
total_reward = 0
while not done:
# 探索或利用
if np.random.random() < epsilon:
# 探索:随机选择动作
action = env.action_space.sample()
else:
# 利用:选择当前Q值最大的动作(如果多个动作Q值相同,随机选择)
max_q = np.max(q_table[state])
action = np.random.choice([a for a in range(action_size) if q_table[state, a] == max_q])
# 执行动作
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# 更新Q表
old_value = q_table[state, action]
# 基于贝尔曼方程的Q-learning更新公式
next_max = np.max(q_table[next_state])
new_value = old_value + alpha * (reward + gamma * next_max - old_value)
q_table[state, action] = new_value
# 更新状态和奖励
state = next_state
total_reward += reward
# 衰减探索率
epsilon = max(epsilon_min, epsilon * epsilon_decay)
# 记录数据
rewards.append(total_reward)
epsilons.append(epsilon)
# 每100个回合打印一次信息
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards[-100:])
print(f"Episode: {episode + 1}, Average Reward: {avg_reward:.2f}, Epsilon: {epsilon:.2f}")
# 可视化训练过程
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards)
plt.title('Total Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.subplot(1, 2, 2)
plt.plot(epsilons)
plt.title('Epsilon per Episode')
plt.xlabel('Episode')
plt.ylabel('Epsilon')
plt.tight_layout()
plt.savefig('q_learning_training.png')
plt.show()
# 评估训练好的智能体
def evaluate_agent(env, q_table, n_eval_episodes=100, max_steps=100):
"""评估Q-learning智能体"""
eval_rewards = []
for episode in range(n_eval_episodes):
state, _ = env.reset()
done = False
total_reward = 0
steps = 0
while not done and steps < max_steps:
# 选择Q值最大的动作(贪婪策略)
action = np.argmax(q_table[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
state = next_state
steps += 1
eval_rewards.append(total_reward)
return np.mean(eval_rewards), np.std(eval_rewards)
# 评估智能体
mean_reward, std_reward = evaluate_agent(env, q_table)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
# 可视化Q表
def visualize_q_table(q_table, action_names=['South', 'North', 'East', 'West', 'Pickup', 'Dropoff']):
"""可视化Q表的一部分"""
# 只取一部分状态进行可视化
sample_states = np.random.choice(q_table.shape[0], size=min(10, q_table.shape[0]), replace=False)
plt.figure(figsize=(12, 8))
sns.heatmap(q_table[sample_states], annot=True, cmap="YlGnBu", yticklabels=sample_states, xticklabels=action_names)
plt.title('Q-table Sample')
plt.ylabel('State')
plt.xlabel('Action')
plt.savefig('q_table_sample.png')
plt.show()
visualize_q_table(q_table)
# 演示智能体行为
def show_agent_behavior(env, q_table, max_steps=100, delay=0.5):
"""演示智能体的行为"""
state, _ = env.reset()
env.render()
done = False
steps = 0
total_reward = 0
while not done and steps < max_steps:
time.sleep(delay) # 添加延迟以便观察
# 选择Q值最大的动作
action = np.argmax(q_table[state])
# 执行动作
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# 更新状态和奖励
state = next_state
total_reward += reward
steps += 1
# 渲染环境
env.render()
print(f"Step: {steps}, Action: {action}, Reward: {reward}, Total Reward: {total_reward}")
print("Episode finished!")
print(f"Total steps: {steps}")
print(f"Total reward: {total_reward}")
# 注意:直接在控制台运行此代码可能无法正确显示渲染结果
# 如需观察智能体行为,请在支持渲染的环境中运行此函数
# show_agent_behavior(env, q_table)
env.close()
当状态空间非常大或连续时,传统的Q表就不再适用了。深度Q网络(DQN)使用神经网络来近似Q函数,是Q-learning的深度学习扩展。
DQN的主要创新点:
下面是DQN的基本实现:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
import gymnasium as gym
# 设置随机种子,确保结果可复现
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
# 设备配置:使用GPU如果可用,否则使用CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Q网络模型
class QNetwork(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
# 经验回放缓冲区
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def add(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return states, actions, rewards, next_states, dones
def __len__(self):
return len(self.buffer)
# DQN智能体
class DQNAgent:
def __init__(self, state_size, action_size, hidden_size=64, lr=1e-3, gamma=0.99,
buffer_size=10000, batch_size=64, update_every=4, tau=1e-3):
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.batch_size = batch_size
self.update_every = update_every
self.tau = tau
# Q网络 - 当前网络和目标网络
self.q_network = QNetwork(state_size, action_size, hidden_size).to(device)
self.target_network = QNetwork(state_size, action_size, hidden_size).to(device)
self.target_network.load_state_dict(self.q_network.state_dict())
# 优化器
self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
# 经验回放缓冲区
self.memory = ReplayBuffer(buffer_size)
# 用于更新目标网络的计数器
self.t_step = 0
def step(self, state, action, reward, next_state, done):
# 将经验存储到回放缓冲区
self.memory.add(state, action, reward, next_state, done)
# 每update_every步学习一次
self.t_step = (self.t_step + 1) % self.update_every
if self.t_step == 0 and len(self.memory) > self.batch_size:
self.learn()
def act(self, state, epsilon=0.0):
"""根据当前策略选择动作"""
state = torch.FloatTensor(state).unsqueeze(0).to(device)
# 评估模式 - 不计算梯度
self.q_network.eval()
with torch.no_grad():
action_values = self.q_network(state)
self.q_network.train()
# Epsilon-greedy策略
if random.random() > epsilon:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_size))
def learn(self):
"""从经验回放中更新值参数"""
# 从内存中随机抽样批次
states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
# 转换为tensor
states = torch.FloatTensor(states).to(device)
actions = torch.LongTensor(actions).unsqueeze(1).to(device)
rewards = torch.FloatTensor(rewards).unsqueeze(1).to(device)
next_states = torch.FloatTensor(next_states).to(device)
dones = torch.FloatTensor(dones).unsqueeze(1).to(device)
# 获取当前Q值估计
q_values = self.q_network(states).gather(1, actions)
# 计算目标Q值
with torch.no_grad():
next_q_values = self.target_network(next_states).max(1, keepdim=True)[0]
target_q_values = rewards + self.gamma * next_q_values * (1 - dones)
# 计算损失
loss = nn.MSELoss()(q_values, target_q_values)
# 优化模型
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 软更新目标网络
self.soft_update()
return loss.item()
def soft_update(self):
"""软更新目标网络参数:θ_target = τ*θ_local + (1-τ)*θ_target"""
for target_param, local_param in zip(self.target_network.parameters(), self.q_network.parameters()):
target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)
# 训练DQN智能体
def train_dqn(env_name="CartPole-v1", num_episodes=1000,
hidden_size=64, lr=1e-3, gamma=0.99, tau=1e-3,
buffer_size=10000, batch_size=64, update_every=4):
"""训练DQN智能体"""
# 创建环境
env = gym.make(env_name)
# 获取状态和动作空间大小
if isinstance(env.observation_space, gym.spaces.Discrete):
state_size = env.observation_space.n
else:
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# 创建智能体
agent = DQNAgent(state_size, action_size, hidden_size, lr, gamma,
buffer_size, batch_size, update_every, tau)
# 初始化ε
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
# 训练日志
scores = []
for i_episode in range(1, num_episodes+1):
state, _ = env.reset()
score = 0
done = False
while not done:
action = agent.act(state, epsilon)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
# 衰减探索率
epsilon = max(epsilon_min, epsilon * epsilon_decay)
# 记录分数
scores.append(score)
# 打印训练进度
if i_episode % 100 == 0:
mean_score = np.mean(scores[-100:])
print(f"Episode {i_episode}/{num_episodes} | Average Score: {mean_score:.2f} | Epsilon: {epsilon:.2f}")
return agent, scores
# 评估智能体
def evaluate_dqn_agent(agent, env_name, num_episodes=10, render=False):
"""评估训练好的DQN智能体"""
env = gym.make(env_name, render_mode='human' if render else None)
scores = []
for i in range(num_episodes):
state, _ = env.reset()
score = 0
done = False
while not done:
action = agent.act(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
state = next_state
score += reward
if render:
env.render()
scores.append(score)
env.close()
return np.mean(scores), np.std(scores)
# 示例用法
# 如果直接运行这个文件,将训练一个CartPole环境的DQN智能体
if __name__ == "__main__":
agent, scores = train_dqn(env_name="CartPole-v1", num_episodes=500)
# 绘制训练过程中的分数
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(scores)
plt.title('DQN Training - CartPole-v1')
plt.xlabel('Episode')
plt.ylabel('Score')
plt.grid(True)
plt.savefig('dqn_training.png')
plt.show()
# 评估智能体
mean_score, std_score = evaluate_dqn_agent(agent, "CartPole-v1", num_episodes=10)
print(f"Evaluation: Mean Score = {mean_score:.2f} ± {std_score:.2f}")
# 可视化智能体行为(可选)
# mean_score, std_score = evaluate_dqn_agent(agent, "CartPole-v1", num_episodes=3, render=True)
理解Q-learning和贝尔曼方程需要考虑以下几个关键点:
贝尔曼方程的核心思想是最优子结构,即一个问题的最优解包含其子问题的最优解。对于强化学习,这意味着最优策略下的状态价值可以通过下一个状态的价值递归定义。
想象一个简单的网格世界,智能体从起点移动到终点。贝尔曼方程告诉我们:当前位置的价值等于下一步可能到达的所有位置价值的加权平均,再加上即时奖励。
Q-learning直接近似贝尔曼最优方程,而不需要知道或学习环境模型(即状态转移概率和奖励函数)。这使它成为一种**无模型(model-free)**算法。
Q-learning更新公式可以看作是使用样本估计贝尔曼最优方程右侧的期望值:
算法 | 更新公式 | 特点 | 适用场景 |
---|---|---|---|
Q-learning | Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)] | 离线学习,直接学习最优策略 | 探索环境风险较低 |
SARSA | Q ( s , a ) ← Q ( s , a ) + α [ r + γ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s',a') - Q(s,a)] Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)] | 在线学习,考虑实际执行的策略 | 探索环境风险较高 |
Q-learning是"激进"的,它总是假设未来会选择最优动作,而SARSA是"保守"的,它考虑当前策略下实际选择的动作。
Q-learning在以下条件下理论上可以收敛到最优Q值:
在实践中,我们通常使用衰减的学习率或较小的恒定学习率,并确保足够的探索。
在复杂问题中,状态空间可能非常大,甚至是连续的,这时表格形式的Q函数不再实用。我们需要使用函数近似(如神经网络)来估计Q值。
清华大学全五版的《DeepSeek教程》完整的文档需要的朋友,关注我私信:deepseek 即可获得。
怎么样今天的内容还满意吗?再次感谢朋友们的观看,关注GZH:凡人的AI工具箱,回复666,送您价值199的AI大礼包。最后,祝您早日实现财务自由,还请给个赞,谢谢!