Keras环境复现代码(二)

PPO CartPole 控制算法实践

  • 实验要求
  • 明确实验目的:学习和实现PPO算法,这是一种改进的策略梯度方法,通过限制策略更新的幅度来提高训练的稳定性。
  • 清楚实验原理:PPO算法是一种基于策略梯度的强化学习算法,它旨在解决传统策略梯度方法(如REINFORCE算法)在训练过程中可能出现的策略更新不稳定问题。PPO算法通过引入一种新的策略更新机制,限制每次更新的幅度,从而提高训练的稳定性和效率。PPO算法的核心原理包括以下几个方面:
  • 1. 策略梯度方法:PPO算法基于策略梯度方法,通过计算策略函数的梯度来更新策略参数。策略梯度方法直接优化策略,使其能够产生更高的预期回报。
  • 2. 限制策略更新幅度:PPO算法通过限制每次策略更新的幅度来提高训练的稳定性。这是通过引入一个裁剪目标函数来实现的,该目标函数限制了新旧策略之间的比率(probability ratio)的更新幅度。
  • 3. 重要性采样:PPO算法使用重要性采样来估计优势函数(advantage function)。优势函数衡量在给定状态下采取某个动作相对于平均水平的好坏。重要性采样允许算法从当前策略下采样的数据中估计优势函数,从而更新策略。
  • 4. 价值函数估计:PPO算法同时学习一个价值函数来估计状态的价值。价值函数的估计用于计算优势函数,同时也用于提高策略更新的稳定性。
  • 5. 优化器:PPO算法使用优化器(如Adam)来更新策略和价值函数的参数。优化器的选择对算法的性能和收敛速度有重要影响。
  • 6. 经验回放:PPO算法使用经验回放机制来存储和重用智能体与环境交互的经验。这有助于提高数据的利用率,并减少样本之间的相关性。

  • 实验代码

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
from keras import layers

import numpy as np
import tensorflow as tf
import gymnasium as gym
import scipy.signal

def discounted_cumulative_sums(x, discount):
    # Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]


class Buffer:
    # Buffer for storing trajectories
    def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
        # Buffer initialization
        self.observation_buffer = np.zeros(
            (size, observation_dimensions), dtype=np.float32
        )
        self.action_buffer = np.zeros(size, dtype=np.int32)
        self.advantage_buffer = np.zeros(size, dtype=np.float32)
        self.reward_buffer = np.zeros(size, dtype=np.float32)
        self.return_buffer = np.zeros(size, dtype=np.float32)
        self.value_buffer = np.zeros(size, dtype=np.float32)
        self.logprobability_buffer = np.zeros(size, dtype=np.float32)
        self.gamma, self.lam = gamma, lam
        self.pointer, self.trajectory_start_index = 0, 0

    def store(self, observation, action, reward, value, logprobability):
        # Append one step of agent-environment interaction
        self.observation_buffer[self.pointer] = observation
        self.action_buffer[self.pointer] = action
        self.reward_buffer[self.pointer] = reward
        self.value_buffer[self.pointer] = value
        self.logprobability_buffer[self.pointer] = logprobability
        self.pointer += 1

    def finish_trajectory(self, last_value=0):
        # Finish the trajectory by computing advantage estimates and rewards-to-go
        path_slice = slice(self.trajectory_start_index, self.pointer)
        rewards = np.append(self.reward_buffer[path_slice], last_value)
        values = np.append(self.value_buffer[path_slice], last_value)

        deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]

        self.advantage_buffer[path_slice] = discounted_cumulative_sums(
            deltas, self.gamma * self.lam
        )
        self.return_buffer[path_slice] = discounted_cumulative_sums(
            rewards, self.gamma
        )[:-1]

        self.trajectory_start_index = self.pointer

    def get(self):
        # Get all data of the buffer and normalize the advantages
        self.pointer, self.trajectory_start_index = 0, 0
        advantage_mean, advantage_std = (
            np.mean(self.advantage_buffer),
            np.std(self.advantage_buffer),
        )
        self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
        return (
            self.observation_buffer,
            self.action_buffer,
            self.advantage_buffer,
            self.return_buffer,
            self.logprobability_buffer,
        )


def mlp(x, sizes, activation=keras.activations.tanh, output_activation=None):
    # Build a feedforward neural network
    for size in sizes[:-1]:
        x = layers.Dense(units=size, activation=activation)(x)
    return layers.Dense(units=sizes[-1], activation=output_activation)(x)


def logprobabilities(logits, a):
    # Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
    logprobabilities_all = keras.ops.log_softmax(logits)
    logprobability = keras.ops.sum(
        keras.ops.one_hot(a, num_actions) * logprobabilities_all, axis=1
    )
    return logprobability


seed_generator = keras.random.SeedGenerator(1337)


# Sample action from actor
@tf.function
def sample_action(observation):
    logits = actor(observation)
    action = keras.ops.squeeze(
        keras.random.categorical(logits, 1, seed=seed_generator), axis=1
    )
    return logits, action


# Train the policy by maxizing the PPO-Clip objective
@tf.function
def train_policy(
    observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        ratio = keras.ops.exp(
            logprobabilities(actor(observation_buffer), action_buffer)
            - logprobability_buffer
        )
        min_advantage = keras.ops.where(
            advantage_buffer > 0,
            (1 + clip_ratio) * advantage_buffer,
            (1 - clip_ratio) * advantage_buffer,
        )

        policy_loss = -keras.ops.mean(
            keras.ops.minimum(ratio * advantage_buffer, min_advantage)
        )
    policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
    policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))

    kl = keras.ops.mean(
        logprobability_buffer
        - logprobabilities(actor(observation_buffer), action_buffer)
    )
    kl = keras.ops.sum(kl)
    return kl


# Train the value function by regression on mean-squared error
@tf.function
def train_value_function(observation_buffer, return_buffer):
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        value_loss = keras.ops.mean((return_buffer - critic(observation_buffer)) ** 2)
    value_grads = tape.gradient(value_loss, critic.trainable_variables)
    value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))

# Hyperparameters of the PPO algorithm
steps_per_epoch = 4000
epochs = 30
gamma = 0.99
clip_ratio = 0.2
policy_learning_rate = 3e-4
value_function_learning_rate = 1e-3
train_policy_iterations = 80
train_value_iterations = 80
lam = 0.97
target_kl = 0.01
hidden_sizes = (64, 64)

# True if you want to render the environment
render = False

# Initialize the environment and get the dimensionality of the
# observation space and the number of possible actions
env = gym.make("CartPole-v1")
observation_dimensions = env.observation_space.shape[0]
num_actions = env.action_space.n

# Initialize the buffer
buffer = Buffer(observation_dimensions, steps_per_epoch)

# Initialize the actor and the critic as keras models
observation_input = keras.Input(shape=(observation_dimensions,), dtype="float32")
logits = mlp(observation_input, list(hidden_sizes) + [num_actions])
actor = keras.Model(inputs=observation_input, outputs=logits)
value = keras.ops.squeeze(mlp(observation_input, list(hidden_sizes) + [1]), axis=1)
critic = keras.Model(inputs=observation_input, outputs=value)

# Initialize the policy and the value function optimizers
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)

# Initialize the observation, episode return and episode length
observation, _ = env.reset()
episode_return, episode_length = 0, 0

# Iterate over the number of epochs
for epoch in range(epochs):
    # Initialize the sum of the returns, lengths and number of episodes for each epoch
    sum_return = 0
    sum_length = 0
    num_episodes = 0

    # Iterate over the steps of each epoch
    for t in range(steps_per_epoch):
        if render:
            env.render()

        # Get the logits, action, and take one step in the environment
        observation = observation.reshape(1, -1)
        logits, action = sample_action(observation)
        observation_new, reward, done, _, _ = env.step(action[0].numpy())
        episode_return += reward
        episode_length += 1

        # Get the value and log-probability of the action
        value_t = critic(observation)
        logprobability_t = logprobabilities(logits, action)

        # Store obs, act, rew, v_t, logp_pi_t
        buffer.store(observation, action, reward, value_t, logprobability_t)

        # Update the observation
        observation = observation_new

        # Finish trajectory if reached to a terminal state
        terminal = done
        if terminal or (t == steps_per_epoch - 1):
            last_value = 0 if done else critic(observation.reshape(1, -1))
            buffer.finish_trajectory(last_value)
            sum_return += episode_return
            sum_length += episode_length
            num_episodes += 1
            observation, _ = env.reset()
            episode_return, episode_length = 0, 0

    # Get values from the buffer
    (
        observation_buffer,
        action_buffer,
        advantage_buffer,
        return_buffer,
        logprobability_buffer,
    ) = buffer.get()

    # Update the policy and implement early stopping using KL divergence
    for _ in range(train_policy_iterations):
        kl = train_policy(
            observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
        )
        if kl > 1.5 * target_kl:
            # Early Stopping
            break

    # Update the value function
    for _ in range(train_value_iterations):
        train_value_function(observation_buffer, return_buffer)

    # Print mean return and length for each epoch
    print(
        f" Epoch: {epoch + 1}. Mean Return: {sum_return / num_episodes}. Mean Length: {sum_length / num_episodes}"
    )
  • 实验现象
  • 根据代码分析,模型架构由三部分构成,Actor-Critic 共享前两层网络,Actor 输出动作概率,Critic 输出状态价值。而PPO 损失由三部分组成:裁剪的策略梯度损失、价值函数均方误差、熵正则化项。实验代码显示使用广义优势估计(GAE)计算优势函数,通过经验回放缓冲区存储轨迹数据,智能体在CartPole环境中的表现逐渐提高,能够更长时间地保持CartPole的平衡。随着训练的进行,智能体获得的平均回报逐渐增加。
  1. 在实验的前几个epoch中(1到5),智能体的平均回报(Mean Return)和平均长度(Mean Length)迅速增加。这表明智能体在早期阶段快速学习并改进其策略。
  2. 从Epoch 6开始,智能体的平均回报和长度显著提升,达到133.33以上。这显示智能体开始掌握更有效的策略来控制CartPole。
  3. 在Epoch 7到10,智能体的表现有所波动,但总体上保持在较高水平(160到285之间)。这可能表明智能体在探索更优策略的同时,也在巩固已有的学习成果。
  4. 从Epoch 11开始,智能体的表现达到并保持在最优水平,平均回报和长度达到4000,表明智能体已经学会了如何在CartPole环境中保持长期的平衡。
  5. 智能体在相对较少的epoch内(30个epoch)就达到了最优性能,显示出PPO算法在这个问题上的高效性。

      1. 实验结果及分析
        1.  Epoch: 1. Mean Return: 25.31645569620253. Mean Length: 25.31645569620253
        2.  Epoch: 2. Mean Return: 29.62962962962963. Mean Length: 29.62962962962963
        3.  Epoch: 3. Mean Return: 45.45454545454545. Mean Length: 45.45454545454545
        4.  Epoch: 4. Mean Return: 64.51612903225806. Mean Length: 64.51612903225806
        5.  Epoch: 5. Mean Return: 105.26315789473684. Mean Length: 105.26315789473684
        6.  Epoch: 6. Mean Return: 133.33333333333334. Mean Length: 133.33333333333334
        7.  Epoch: 7. Mean Return: 160.0. Mean Length: 160.0
        8.  Epoch: 8. Mean Return: 210.52631578947367. Mean Length: 210.52631578947367
        9.  Epoch: 9. Mean Return: 285.7142857142857. Mean Length: 285.7142857142857
        10.  Epoch: 10. Mean Return: 250.0. Mean Length: 250.0
        11.  Epoch: 11. Mean Return: 250.0. Mean Length: 250.0
        12.  Epoch: 12. Mean Return: 666.6666666666666. Mean Length: 666.6666666666666
        13.  Epoch: 13. Mean Return: 444.44444444444446. Mean Length: 444.44444444444446
        14.  Epoch: 14. Mean Return: 500.0. Mean Length: 500.0
        15.  Epoch: 15. Mean Return: 4000.0. Mean Length: 4000.0
        16.  Epoch: 16. Mean Return: 4000.0. Mean Length: 4000.0
        17.  Epoch: 17. Mean Return: 4000.0. Mean Length: 4000.0
        18.  Epoch: 18. Mean Return: 4000.0. Mean Length: 4000.0
        19.  Epoch: 19. Mean Return: 4000.0. Mean Length: 4000.0
        20.  Epoch: 20. Mean Return: 4000.0. Mean Length: 4000.0
        21.  Epoch: 21. Mean Return: 4000.0. Mean Length: 4000.0
        22.  Epoch: 22. Mean Return: 4000.0. Mean Length: 4000.0
        23.  Epoch: 23. Mean Return: 4000.0. Mean Length: 4000.0
        24.  Epoch: 24. Mean Return: 4000.0. Mean Length: 4000.0
        25.  Epoch: 25. Mean Return: 4000.0. Mean Length: 4000.0
        26.  Epoch: 26. Mean Return: 4000.0. Mean Length: 4000.0
        27.  Epoch: 27. Mean Return: 4000.0. Mean Length: 4000.0
        28.  Epoch: 28. Mean Return: 4000.0. Mean Length: 4000.0
        29.  Epoch: 29. Mean Return: 4000.0. Mean Length: 4000.0
        30.  Epoch: 30. Mean Return: 4000.0. Mean Length: 4000.0

        • 实验总结
        • 实验结果表明,PPO算法能够有效地训练智能体解决CartPole问题。智能体不仅学会了保持CartPole的平衡,而且能够在相对较少的训练轮次内达到最优性能。通过限制策略更新的幅度,PPO算法提高了训练过程的稳定性。这在实验中表现为智能体性能的快速提升和较少的波动。智能体在训练过程中表现出良好的探索与利用平衡。它能够在探索新策略的同时,利用已有知识来提高性能。PPO算法不仅适用于CartPole这样的简单任务,而且由于其稳定性和高效性,也适用于更复杂的强化学习任务。

        你可能感兴趣的:(Keras,机器学习,人工智能)