强化学习:从Policy Gradient到PPO
1. 强化学习概述
强化学习(Reinforcement Learning, RL)是机器学习的一个重要分支,它关注智能体如何在环境中通过与环境的交互学习最优策略,以最大化累积奖励。
核心概念
- 智能体(Agent):在环境中执行动作的实体
- 环境(Environment):智能体交互的外部世界
- 状态(State):环境的当前状态
- 动作(Action):智能体可以执行的动作
- 奖励(Reward):环境对智能体动作的反馈
- 策略(Policy):智能体从状态到动作的映射
- 价值函数(Value Function):评估状态或状态-动作对的价值
- Q函数(Q-Function):评估状态-动作对的价值
2. 策略梯度方法
2.1 基本原理
策略梯度方法直接优化策略函数,通过计算策略的梯度并沿梯度方向更新策略参数。
核心思想:
- 策略被参数化为神经网络
- 通过采样轨迹计算梯度
- 沿期望奖励的梯度方向更新参数
2.2 蒙特卡洛策略梯度
import torch import torch.nn as nn import torch.optim as optim import numpy as np class PolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim): super(PolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, action_dim) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.softmax(self.fc2(x), dim=-1) return x class REINFORCE: def __init__(self, state_dim, action_dim, lr=0.001, gamma=0.99): self.policy = PolicyNetwork(state_dim, action_dim) self.optimizer = optim.Adam(self.policy.parameters(), lr=lr) self.gamma = gamma def select_action(self, state): state = torch.FloatTensor(state) probs = self.policy(state) action = torch.multinomial(probs, 1).item() return action, probs[action] def update(self, rewards, log_probs): discounted_rewards = [] G = 0 for r in reversed(rewards): G = r + self.gamma * G discounted_rewards.insert(0, G) discounted_rewards = torch.FloatTensor(discounted_rewards) discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9) policy_loss = [] for log_prob, G in zip(log_probs, discounted_rewards): policy_loss.append(-log_prob * G) self.optimizer.zero_grad() policy_loss = torch.cat(policy_loss).sum() policy_loss.backward() self.optimizer.step()2.3 优势函数
优势函数用于减少梯度估计的方差:
def compute_advantage(rewards, values, gamma=0.99, lambda_=0.95): advantages = [] advantage = 0 for i in reversed(range(len(rewards))): delta = rewards[i] + gamma * (values[i+1] if i < len(rewards)-1 else 0) - values[i] advantage = delta + gamma * lambda_ * advantage advantages.insert(0, advantage) return advantages3. Actor-Critic方法
3.1 基本原理
Actor-Critic方法结合了值函数和策略梯度的优点:
- Actor:学习策略,选择动作
- Critic:学习价值函数,评估动作的价值
3.2 实现示例
class ValueNetwork(nn.Module): def __init__(self, state_dim): super(ValueNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x class ActorCritic: def __init__(self, state_dim, action_dim, lr_actor=0.0001, lr_critic=0.0005, gamma=0.99): self.actor = PolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer_actor = optim.Adam(self.actor.parameters(), lr=lr_actor) self.optimizer_critic = optim.Adam(self.critic.parameters(), lr=lr_critic) self.gamma = gamma def select_action(self, state): state = torch.FloatTensor(state) probs = self.actor(state) action = torch.multinomial(probs, 1).item() return action, probs[action] def update(self, states, actions, rewards, next_states, dones): states = torch.FloatTensor(states) actions = torch.LongTensor(actions) rewards = torch.FloatTensor(rewards) next_states = torch.FloatTensor(next_states) dones = torch.FloatTensor(dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() targets = rewards + self.gamma * next_values * (1 - dones) advantages = targets - values # 更新critic critic_loss = advantages.pow(2).mean() self.optimizer_critic.zero_grad() critic_loss.backward() self.optimizer_critic.step() # 更新actor probs = self.actor(states) log_probs = torch.log(probs.gather(1, actions.unsqueeze(1))).squeeze() actor_loss = -(log_probs * advantages.detach()).mean() self.optimizer_actor.zero_grad() actor_loss.backward() self.optimizer_actor.step()4. TRPO与PPO
4.1 TRPO(Trust Region Policy Optimization)
TRPO通过约束策略更新的幅度来确保策略改进的单调性:
- 使用KL散度约束策略更新
- 采用共轭梯度法求解优化问题
- 计算复杂度高,但性能稳定
4.2 PPO(Proximal Policy Optimization)
PPO是TRPO的简化版本,通过Clip目标函数来约束策略更新:
- 计算简单,易于实现
- 性能与TRPO相当或更好
- 成为强化学习的主流算法
4.3 PPO实现
class PPO: def __init__(self, state_dim, action_dim, lr=0.0003, gamma=0.99, gae_lambda=0.95, eps_clip=0.2, K_epochs=4): self.actor = PolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer = optim.Adam(list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr) self.gamma = gamma self.gae_lambda = gae_lambda self.eps_clip = eps_clip self.K_epochs = K_epochs def select_action(self, state): state = torch.FloatTensor(state) probs = self.actor(state) dist = torch.distributions.Categorical(probs) action = dist.sample() log_prob = dist.log_prob(action) return action.item(), log_prob.item() def compute_gae(self, rewards, values, next_values, dones): advantages = [] advantage = 0 for i in reversed(range(len(rewards))): delta = rewards[i] + self.gamma * next_values[i] * (1 - dones[i]) - values[i] advantage = delta + self.gamma * self.gae_lambda * advantage * (1 - dones[i]) advantages.insert(0, advantage) return advantages def update(self, memory): states = torch.FloatTensor(memory.states) actions = torch.LongTensor(memory.actions) old_log_probs = torch.FloatTensor(memory.log_probs) rewards = torch.FloatTensor(memory.rewards) next_states = torch.FloatTensor(memory.next_states) dones = torch.FloatTensor(memory.dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() advantages = self.compute_gae(rewards, values.detach().numpy(), next_values.detach().numpy(), dones.numpy()) advantages = torch.FloatTensor(advantages) # 标准化优势 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-9) # 多次更新 for _ in range(self.K_epochs): # 计算新的概率和值 probs = self.actor(states) dist = torch.distributions.Categorical(probs) new_log_probs = dist.log_prob(actions) # 计算比率 ratio = torch.exp(new_log_probs - old_log_probs) # 计算PPO目标函数 surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantages actor_loss = -torch.min(surr1, surr2).mean() # 计算critic损失 new_values = self.critic(states).squeeze() returns = advantages + values critic_loss = (new_values - returns).pow(2).mean() # 总损失 loss = actor_loss + 0.5 * critic_loss # 更新 self.optimizer.zero_grad() loss.backward() self.optimizer.step() # 经验回放缓冲区 class Memory: def __init__(self): self.states = [] self.actions = [] self.log_probs = [] self.rewards = [] self.next_states = [] self.dones = [] def add(self, state, action, log_prob, reward, next_state, done): self.states.append(state) self.actions.append(action) self.log_probs.append(log_prob) self.rewards.append(reward) self.next_states.append(next_state) self.dones.append(done) def clear(self): self.states = [] self.actions = [] self.log_probs = [] self.rewards = [] self.next_states = [] self.dones = []5. 连续动作空间
5.1 高斯策略
对于连续动作空间,通常使用高斯分布来参数化策略:
class ContinuousPolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim): super(ContinuousPolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, action_dim) self.log_std = nn.Parameter(torch.zeros(action_dim)) def forward(self, x): x = torch.relu(self.fc1(x)) mean = self.fc2(x) std = torch.exp(self.log_std) return mean, std class ContinuousPPO: def __init__(self, state_dim, action_dim, lr=0.0003, gamma=0.99, gae_lambda=0.95, eps_clip=0.2, K_epochs=4): self.actor = ContinuousPolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer = optim.Adam(list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr) self.gamma = gamma self.gae_lambda = gae_lambda self.eps_clip = eps_clip self.K_epochs = K_epochs def select_action(self, state): state = torch.FloatTensor(state) mean, std = self.actor(state) dist = torch.distributions.Normal(mean, std) action = dist.sample() log_prob = dist.log_prob(action).sum() return action.numpy(), log_prob.item() def update(self, memory): states = torch.FloatTensor(memory.states) actions = torch.FloatTensor(memory.actions) old_log_probs = torch.FloatTensor(memory.log_probs) rewards = torch.FloatTensor(memory.rewards) next_states = torch.FloatTensor(memory.next_states) dones = torch.FloatTensor(memory.dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() advantages = self.compute_gae(rewards, values.detach().numpy(), next_values.detach().numpy(), dones.numpy()) advantages = torch.FloatTensor(advantages) # 标准化优势 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-9) # 多次更新 for _ in range(self.K_epochs): # 计算新的概率和值 mean, std = self.actor(states) dist = torch.distributions.Normal(mean, std) new_log_probs = dist.log_prob(actions).sum(dim=1) # 计算比率 ratio = torch.exp(new_log_probs - old_log_probs) # 计算PPO目标函数 surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantages actor_loss = -torch.min(surr1, surr2).mean() # 计算critic损失 new_values = self.critic(states).squeeze() returns = advantages + values critic_loss = (new_values - returns).pow(2).mean() # 总损失 loss = actor_loss + 0.5 * critic_loss # 更新 self.optimizer.zero_grad() loss.backward() self.optimizer.step()6. 训练技巧
6.1 超参数调优
- 学习率:通常在1e-4到3e-4之间
- 批量大小:根据内存情况调整,通常为1024或2048
- K_epochs:通常为4-10
- eps_clip:通常为0.1-0.3
- gamma:通常为0.99
- gae_lambda:通常为0.95
6.2 奖励塑造
合理的奖励函数设计对训练效果至关重要:
- 稀疏奖励:为重要事件提供大奖励
- ** shaping**:添加中间奖励引导学习
- 归一化:标准化奖励以稳定训练
6.3 探索策略
- 噪声注入:在动作中添加噪声
- ε-贪心:以一定概率随机选择动作
- 熵正则化:鼓励探索
7. 环境与工具
7.1 OpenAI Gym
OpenAI Gym是强化学习的标准环境库:
import gym # 创建环境 env = gym.make('CartPole-v1') # 重置环境 state = env.reset() # 执行动作 for _ in range(1000): action = env.action_space.sample() # 随机动作 next_state, reward, done, info = env.step(action) if done: state = env.reset() else: state = next_state # 关闭环境 env.close()7.2 Stable Baselines3
Stable Baselines3是一个流行的强化学习库,提供了各种算法的实现:
from stable_baselines3 import PPO from stable_baselines3.common.env_util import make_vec_env # 创建环境 env = make_vec_env('CartPole-v1', n_envs=1) # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=10000) # 测试模型 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()7.3 RLlib
RLlib是Ray的强化学习库,支持分布式训练:
import ray from ray import tune from ray.rllib.agents.ppo import PPOTrainer # 配置 config = { "env": "CartPole-v1", "framework": "torch", "num_workers": 2, "num_gpus": 0, "lr": 5e-4, "gamma": 0.99, "lambda": 0.95, "clip_param": 0.2, "kl_coeff": 0.1, "num_sgd_iter": 4, "sgd_minibatch_size": 128, "train_batch_size": 1024, } # 训练 trainer = PPOTrainer(config=config) for i in range(10): result = trainer.train() print(f"Iteration {i}, reward: {result['episode_reward_mean']}") # 测试 env = trainer.env_creator(config["env"]) obs = env.reset() done = False reward_total = 0 while not done: action = trainer.compute_action(obs) obs, reward, done, info = env.step(action) reward_total += reward print(f"Total reward: {reward_total}")8. 实际应用案例
8.1 机器人控制
import gym from stable_baselines3 import PPO # 创建环境 env = gym.make('BipedalWalker-v3') # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=1000000) # 保存模型 model.save("bipedal_walker_ppo") # 加载模型 model = PPO.load("bipedal_walker_ppo") # 测试 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs, deterministic=True) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()8.2 游戏AI
import gym import gym_super_mario_bros from nes_py.wrappers import JoypadSpace from gym_super_mario_bros.actions import SIMPLE_MOVEMENT from stable_baselines3 import PPO from stable_baselines3.common.vec_env import DummyVecEnv # 创建环境 env = gym_super_mario_bros.make('SuperMarioBros-v0') env = JoypadSpace(env, SIMPLE_MOVEMENT) env = DummyVecEnv([lambda: env]) # 初始化模型 model = PPO('CnnPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=1000000) # 测试 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()8.3 金融交易
import numpy as np import gym from gym import spaces class TradingEnv(gym.Env): def __init__(self, data): super(TradingEnv, self).__init__() self.data = data self.current_step = 0 self.balance = 10000 self.shares = 0 # 动作空间:0=持有,1=买入,2=卖出 self.action_space = spaces.Discrete(3) # 观测空间:价格、余额、持股数 self.observation_space = spaces.Box( low=0, high=np.inf, shape=(3,), dtype=np.float32 ) def reset(self): self.current_step = 0 self.balance = 10000 self.shares = 0 return self._get_observation() def _get_observation(self): return np.array([ self.data[self.current_step], self.balance, self.shares ], dtype=np.float32) def step(self, action): current_price = self.data[self.current_step] # 执行动作 if action == 1: # 买入 if self.balance > current_price: self.shares += 1 self.balance -= current_price elif action == 2: # 卖出 if self.shares > 0: self.shares -= 1 self.balance += current_price # 进入下一步 self.current_step += 1 done = self.current_step >= len(self.data) - 1 # 计算奖励 total_asset = self.balance + self.shares * current_price reward = total_asset - 10000 # 相对于初始资金的收益 return self._get_observation(), reward, done, {} # 创建环境 data = np.random.randn(1000) + 100 # 模拟价格数据 env = TradingEnv(data) # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=10000) # 测试 def test_trading_agent(model, env): obs = env.reset() total_reward = 0 while True: action, _ = model.predict(obs) obs, reward, done, info = env.step(action) total_reward += reward if done: break print(f"Total reward: {total_reward}") print(f"Final balance: {env.balance}") print(f"Final shares: {env.shares}") test_trading_agent(model, env)9. 挑战与解决方案
9.1 常见挑战
- 样本效率低:需要大量交互数据
- 奖励稀疏:难以学习复杂任务
- 稳定性问题:训练过程不稳定
- 超参数敏感:性能对超参数很敏感
- 探索-利用困境:平衡探索和利用
9.2 解决方案
- 经验回放:重用过去的经验
- 优先级经验回放:优先学习重要经验
- ** curiosity-driven exploration**:基于好奇心的探索
- 课程学习:从简单任务开始,逐步增加难度
- 模仿学习:结合专家示范
- 模型预测:使用模型预测未来状态
10. 结论
强化学习是一种强大的机器学习范式,从Policy Gradient到PPO的发展,使得我们能够解决越来越复杂的任务。PPO作为当前最流行的算法之一,以其简单性和有效性成为许多强化学习应用的首选。
核心要点
- 策略梯度:直接优化策略,适用于连续动作空间
- Actor-Critic:结合值函数和策略梯度的优点
- PPO:通过Clip目标函数约束策略更新,平衡稳定性和样本效率
- 连续动作:使用高斯分布参数化策略
- 训练技巧:超参数调优、奖励塑造、探索策略
- 工具生态:OpenAI Gym、Stable Baselines3、RLlib
未来发展
- 多智能体强化学习:多个智能体的交互与合作
- 分层强化学习:学习层次化的策略
- 模型-based强化学习:结合模型预测提高样本效率
- 离线强化学习:从静态数据集学习
- 多任务强化学习:在多个任务间迁移知识
- 可解释性:提高强化学习模型的可解释性
通过不断的研究和实践,强化学习将在更多领域发挥重要作用,如机器人控制、游戏AI、金融交易、自动驾驶等,为解决复杂的决策问题提供强大的工具。