news 2026/4/25 1:03:24

强化学习:从Policy Gradient到PPO

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
强化学习:从Policy Gradient到PPO

强化学习:从Policy Gradient到PPO

1. 强化学习概述

强化学习(Reinforcement Learning, RL)是机器学习的一个重要分支,它关注智能体如何在环境中通过与环境的交互学习最优策略,以最大化累积奖励。

核心概念

  • 智能体(Agent):在环境中执行动作的实体
  • 环境(Environment):智能体交互的外部世界
  • 状态(State):环境的当前状态
  • 动作(Action):智能体可以执行的动作
  • 奖励(Reward):环境对智能体动作的反馈
  • 策略(Policy):智能体从状态到动作的映射
  • 价值函数(Value Function):评估状态或状态-动作对的价值
  • Q函数(Q-Function):评估状态-动作对的价值

2. 策略梯度方法

2.1 基本原理

策略梯度方法直接优化策略函数,通过计算策略的梯度并沿梯度方向更新策略参数。

核心思想

  • 策略被参数化为神经网络
  • 通过采样轨迹计算梯度
  • 沿期望奖励的梯度方向更新参数

2.2 蒙特卡洛策略梯度

import torch import torch.nn as nn import torch.optim as optim import numpy as np class PolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim): super(PolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, action_dim) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.softmax(self.fc2(x), dim=-1) return x class REINFORCE: def __init__(self, state_dim, action_dim, lr=0.001, gamma=0.99): self.policy = PolicyNetwork(state_dim, action_dim) self.optimizer = optim.Adam(self.policy.parameters(), lr=lr) self.gamma = gamma def select_action(self, state): state = torch.FloatTensor(state) probs = self.policy(state) action = torch.multinomial(probs, 1).item() return action, probs[action] def update(self, rewards, log_probs): discounted_rewards = [] G = 0 for r in reversed(rewards): G = r + self.gamma * G discounted_rewards.insert(0, G) discounted_rewards = torch.FloatTensor(discounted_rewards) discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9) policy_loss = [] for log_prob, G in zip(log_probs, discounted_rewards): policy_loss.append(-log_prob * G) self.optimizer.zero_grad() policy_loss = torch.cat(policy_loss).sum() policy_loss.backward() self.optimizer.step()

2.3 优势函数

优势函数用于减少梯度估计的方差:

def compute_advantage(rewards, values, gamma=0.99, lambda_=0.95): advantages = [] advantage = 0 for i in reversed(range(len(rewards))): delta = rewards[i] + gamma * (values[i+1] if i < len(rewards)-1 else 0) - values[i] advantage = delta + gamma * lambda_ * advantage advantages.insert(0, advantage) return advantages

3. Actor-Critic方法

3.1 基本原理

Actor-Critic方法结合了值函数和策略梯度的优点:

  • Actor:学习策略,选择动作
  • Critic:学习价值函数,评估动作的价值

3.2 实现示例

class ValueNetwork(nn.Module): def __init__(self, state_dim): super(ValueNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x class ActorCritic: def __init__(self, state_dim, action_dim, lr_actor=0.0001, lr_critic=0.0005, gamma=0.99): self.actor = PolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer_actor = optim.Adam(self.actor.parameters(), lr=lr_actor) self.optimizer_critic = optim.Adam(self.critic.parameters(), lr=lr_critic) self.gamma = gamma def select_action(self, state): state = torch.FloatTensor(state) probs = self.actor(state) action = torch.multinomial(probs, 1).item() return action, probs[action] def update(self, states, actions, rewards, next_states, dones): states = torch.FloatTensor(states) actions = torch.LongTensor(actions) rewards = torch.FloatTensor(rewards) next_states = torch.FloatTensor(next_states) dones = torch.FloatTensor(dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() targets = rewards + self.gamma * next_values * (1 - dones) advantages = targets - values # 更新critic critic_loss = advantages.pow(2).mean() self.optimizer_critic.zero_grad() critic_loss.backward() self.optimizer_critic.step() # 更新actor probs = self.actor(states) log_probs = torch.log(probs.gather(1, actions.unsqueeze(1))).squeeze() actor_loss = -(log_probs * advantages.detach()).mean() self.optimizer_actor.zero_grad() actor_loss.backward() self.optimizer_actor.step()

4. TRPO与PPO

4.1 TRPO(Trust Region Policy Optimization)

TRPO通过约束策略更新的幅度来确保策略改进的单调性:

  • 使用KL散度约束策略更新
  • 采用共轭梯度法求解优化问题
  • 计算复杂度高,但性能稳定

4.2 PPO(Proximal Policy Optimization)

PPO是TRPO的简化版本,通过Clip目标函数来约束策略更新:

  • 计算简单,易于实现
  • 性能与TRPO相当或更好
  • 成为强化学习的主流算法

4.3 PPO实现

class PPO: def __init__(self, state_dim, action_dim, lr=0.0003, gamma=0.99, gae_lambda=0.95, eps_clip=0.2, K_epochs=4): self.actor = PolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer = optim.Adam(list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr) self.gamma = gamma self.gae_lambda = gae_lambda self.eps_clip = eps_clip self.K_epochs = K_epochs def select_action(self, state): state = torch.FloatTensor(state) probs = self.actor(state) dist = torch.distributions.Categorical(probs) action = dist.sample() log_prob = dist.log_prob(action) return action.item(), log_prob.item() def compute_gae(self, rewards, values, next_values, dones): advantages = [] advantage = 0 for i in reversed(range(len(rewards))): delta = rewards[i] + self.gamma * next_values[i] * (1 - dones[i]) - values[i] advantage = delta + self.gamma * self.gae_lambda * advantage * (1 - dones[i]) advantages.insert(0, advantage) return advantages def update(self, memory): states = torch.FloatTensor(memory.states) actions = torch.LongTensor(memory.actions) old_log_probs = torch.FloatTensor(memory.log_probs) rewards = torch.FloatTensor(memory.rewards) next_states = torch.FloatTensor(memory.next_states) dones = torch.FloatTensor(memory.dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() advantages = self.compute_gae(rewards, values.detach().numpy(), next_values.detach().numpy(), dones.numpy()) advantages = torch.FloatTensor(advantages) # 标准化优势 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-9) # 多次更新 for _ in range(self.K_epochs): # 计算新的概率和值 probs = self.actor(states) dist = torch.distributions.Categorical(probs) new_log_probs = dist.log_prob(actions) # 计算比率 ratio = torch.exp(new_log_probs - old_log_probs) # 计算PPO目标函数 surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantages actor_loss = -torch.min(surr1, surr2).mean() # 计算critic损失 new_values = self.critic(states).squeeze() returns = advantages + values critic_loss = (new_values - returns).pow(2).mean() # 总损失 loss = actor_loss + 0.5 * critic_loss # 更新 self.optimizer.zero_grad() loss.backward() self.optimizer.step() # 经验回放缓冲区 class Memory: def __init__(self): self.states = [] self.actions = [] self.log_probs = [] self.rewards = [] self.next_states = [] self.dones = [] def add(self, state, action, log_prob, reward, next_state, done): self.states.append(state) self.actions.append(action) self.log_probs.append(log_prob) self.rewards.append(reward) self.next_states.append(next_state) self.dones.append(done) def clear(self): self.states = [] self.actions = [] self.log_probs = [] self.rewards = [] self.next_states = [] self.dones = []

5. 连续动作空间

5.1 高斯策略

对于连续动作空间,通常使用高斯分布来参数化策略:

class ContinuousPolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim): super(ContinuousPolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, action_dim) self.log_std = nn.Parameter(torch.zeros(action_dim)) def forward(self, x): x = torch.relu(self.fc1(x)) mean = self.fc2(x) std = torch.exp(self.log_std) return mean, std class ContinuousPPO: def __init__(self, state_dim, action_dim, lr=0.0003, gamma=0.99, gae_lambda=0.95, eps_clip=0.2, K_epochs=4): self.actor = ContinuousPolicyNetwork(state_dim, action_dim) self.critic = ValueNetwork(state_dim) self.optimizer = optim.Adam(list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr) self.gamma = gamma self.gae_lambda = gae_lambda self.eps_clip = eps_clip self.K_epochs = K_epochs def select_action(self, state): state = torch.FloatTensor(state) mean, std = self.actor(state) dist = torch.distributions.Normal(mean, std) action = dist.sample() log_prob = dist.log_prob(action).sum() return action.numpy(), log_prob.item() def update(self, memory): states = torch.FloatTensor(memory.states) actions = torch.FloatTensor(memory.actions) old_log_probs = torch.FloatTensor(memory.log_probs) rewards = torch.FloatTensor(memory.rewards) next_states = torch.FloatTensor(memory.next_states) dones = torch.FloatTensor(memory.dones) # 计算价值和优势 values = self.critic(states).squeeze() next_values = self.critic(next_states).squeeze() advantages = self.compute_gae(rewards, values.detach().numpy(), next_values.detach().numpy(), dones.numpy()) advantages = torch.FloatTensor(advantages) # 标准化优势 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-9) # 多次更新 for _ in range(self.K_epochs): # 计算新的概率和值 mean, std = self.actor(states) dist = torch.distributions.Normal(mean, std) new_log_probs = dist.log_prob(actions).sum(dim=1) # 计算比率 ratio = torch.exp(new_log_probs - old_log_probs) # 计算PPO目标函数 surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantages actor_loss = -torch.min(surr1, surr2).mean() # 计算critic损失 new_values = self.critic(states).squeeze() returns = advantages + values critic_loss = (new_values - returns).pow(2).mean() # 总损失 loss = actor_loss + 0.5 * critic_loss # 更新 self.optimizer.zero_grad() loss.backward() self.optimizer.step()

6. 训练技巧

6.1 超参数调优

  • 学习率:通常在1e-4到3e-4之间
  • 批量大小:根据内存情况调整,通常为1024或2048
  • K_epochs:通常为4-10
  • eps_clip:通常为0.1-0.3
  • gamma:通常为0.99
  • gae_lambda:通常为0.95

6.2 奖励塑造

合理的奖励函数设计对训练效果至关重要:

  • 稀疏奖励:为重要事件提供大奖励
  • ** shaping**:添加中间奖励引导学习
  • 归一化:标准化奖励以稳定训练

6.3 探索策略

  • 噪声注入:在动作中添加噪声
  • ε-贪心:以一定概率随机选择动作
  • 熵正则化:鼓励探索

7. 环境与工具

7.1 OpenAI Gym

OpenAI Gym是强化学习的标准环境库:

import gym # 创建环境 env = gym.make('CartPole-v1') # 重置环境 state = env.reset() # 执行动作 for _ in range(1000): action = env.action_space.sample() # 随机动作 next_state, reward, done, info = env.step(action) if done: state = env.reset() else: state = next_state # 关闭环境 env.close()

7.2 Stable Baselines3

Stable Baselines3是一个流行的强化学习库,提供了各种算法的实现:

from stable_baselines3 import PPO from stable_baselines3.common.env_util import make_vec_env # 创建环境 env = make_vec_env('CartPole-v1', n_envs=1) # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=10000) # 测试模型 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()

7.3 RLlib

RLlib是Ray的强化学习库,支持分布式训练:

import ray from ray import tune from ray.rllib.agents.ppo import PPOTrainer # 配置 config = { "env": "CartPole-v1", "framework": "torch", "num_workers": 2, "num_gpus": 0, "lr": 5e-4, "gamma": 0.99, "lambda": 0.95, "clip_param": 0.2, "kl_coeff": 0.1, "num_sgd_iter": 4, "sgd_minibatch_size": 128, "train_batch_size": 1024, } # 训练 trainer = PPOTrainer(config=config) for i in range(10): result = trainer.train() print(f"Iteration {i}, reward: {result['episode_reward_mean']}") # 测试 env = trainer.env_creator(config["env"]) obs = env.reset() done = False reward_total = 0 while not done: action = trainer.compute_action(obs) obs, reward, done, info = env.step(action) reward_total += reward print(f"Total reward: {reward_total}")

8. 实际应用案例

8.1 机器人控制

import gym from stable_baselines3 import PPO # 创建环境 env = gym.make('BipedalWalker-v3') # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=1000000) # 保存模型 model.save("bipedal_walker_ppo") # 加载模型 model = PPO.load("bipedal_walker_ppo") # 测试 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs, deterministic=True) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()

8.2 游戏AI

import gym import gym_super_mario_bros from nes_py.wrappers import JoypadSpace from gym_super_mario_bros.actions import SIMPLE_MOVEMENT from stable_baselines3 import PPO from stable_baselines3.common.vec_env import DummyVecEnv # 创建环境 env = gym_super_mario_bros.make('SuperMarioBros-v0') env = JoypadSpace(env, SIMPLE_MOVEMENT) env = DummyVecEnv([lambda: env]) # 初始化模型 model = PPO('CnnPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=1000000) # 测试 obs = env.reset() for _ in range(1000): action, _states = model.predict(obs) obs, rewards, dones, info = env.step(action) env.render() if dones: obs = env.reset()

8.3 金融交易

import numpy as np import gym from gym import spaces class TradingEnv(gym.Env): def __init__(self, data): super(TradingEnv, self).__init__() self.data = data self.current_step = 0 self.balance = 10000 self.shares = 0 # 动作空间:0=持有,1=买入,2=卖出 self.action_space = spaces.Discrete(3) # 观测空间:价格、余额、持股数 self.observation_space = spaces.Box( low=0, high=np.inf, shape=(3,), dtype=np.float32 ) def reset(self): self.current_step = 0 self.balance = 10000 self.shares = 0 return self._get_observation() def _get_observation(self): return np.array([ self.data[self.current_step], self.balance, self.shares ], dtype=np.float32) def step(self, action): current_price = self.data[self.current_step] # 执行动作 if action == 1: # 买入 if self.balance > current_price: self.shares += 1 self.balance -= current_price elif action == 2: # 卖出 if self.shares > 0: self.shares -= 1 self.balance += current_price # 进入下一步 self.current_step += 1 done = self.current_step >= len(self.data) - 1 # 计算奖励 total_asset = self.balance + self.shares * current_price reward = total_asset - 10000 # 相对于初始资金的收益 return self._get_observation(), reward, done, {} # 创建环境 data = np.random.randn(1000) + 100 # 模拟价格数据 env = TradingEnv(data) # 初始化模型 model = PPO('MlpPolicy', env, verbose=1) # 训练模型 model.learn(total_timesteps=10000) # 测试 def test_trading_agent(model, env): obs = env.reset() total_reward = 0 while True: action, _ = model.predict(obs) obs, reward, done, info = env.step(action) total_reward += reward if done: break print(f"Total reward: {total_reward}") print(f"Final balance: {env.balance}") print(f"Final shares: {env.shares}") test_trading_agent(model, env)

9. 挑战与解决方案

9.1 常见挑战

  • 样本效率低:需要大量交互数据
  • 奖励稀疏:难以学习复杂任务
  • 稳定性问题:训练过程不稳定
  • 超参数敏感:性能对超参数很敏感
  • 探索-利用困境:平衡探索和利用

9.2 解决方案

  • 经验回放:重用过去的经验
  • 优先级经验回放:优先学习重要经验
  • ** curiosity-driven exploration**:基于好奇心的探索
  • 课程学习:从简单任务开始,逐步增加难度
  • 模仿学习:结合专家示范
  • 模型预测:使用模型预测未来状态

10. 结论

强化学习是一种强大的机器学习范式,从Policy Gradient到PPO的发展,使得我们能够解决越来越复杂的任务。PPO作为当前最流行的算法之一,以其简单性和有效性成为许多强化学习应用的首选。

核心要点

  1. 策略梯度:直接优化策略,适用于连续动作空间
  2. Actor-Critic:结合值函数和策略梯度的优点
  3. PPO:通过Clip目标函数约束策略更新,平衡稳定性和样本效率
  4. 连续动作:使用高斯分布参数化策略
  5. 训练技巧:超参数调优、奖励塑造、探索策略
  6. 工具生态:OpenAI Gym、Stable Baselines3、RLlib

未来发展

  • 多智能体强化学习:多个智能体的交互与合作
  • 分层强化学习:学习层次化的策略
  • 模型-based强化学习:结合模型预测提高样本效率
  • 离线强化学习:从静态数据集学习
  • 多任务强化学习:在多个任务间迁移知识
  • 可解释性:提高强化学习模型的可解释性

通过不断的研究和实践,强化学习将在更多领域发挥重要作用,如机器人控制、游戏AI、金融交易、自动驾驶等,为解决复杂的决策问题提供强大的工具。

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/25 1:01:18

AI应用开发系列(六) 企业AI应用的安全与合规

企业 AI 应用的安全与合规&#xff1a;数据不泄露、回答不越界、上线不踩雷系列导读&#xff1a;这是「企业 AI 应用开发」第 6 篇。前面咱们聊了模型接入、RAG、Agent、微调部署。今天聊一个"不那么酷但极其重要"的话题&#xff1a;安全与合规。你的 AI 应用再智能&…

作者头像 李华
网站建设 2026/4/25 0:57:58

iSpy Connect开源监控平台评测:能替代传统NVR吗?聊聊它的优势与硬伤

iSpy Connect开源监控平台深度评测&#xff1a;技术架构与商业场景适配性分析 在数字化转型浪潮下&#xff0c;视频监控系统的选型正面临开源软件与专用硬件的抉择。iSpy Connect作为一款基于FFmpeg的多平台监控解决方案&#xff0c;其开源属性与跨平台特性吸引了大量技术型用户…

作者头像 李华
网站建设 2026/4/25 0:55:02

GPT-5.5重磅发布!速度与智能并进,让AI更懂你!

GPT-5.5 发布了&#xff0c;现在一直在用5.4&#xff0c;推理和agent调用都不错&#xff0c;就是速度有点慢。 GPT‑5.5 能更快地理解您的意图&#xff0c;并能承担更多工作本身。它在编写和调试代码、在线研究、数据分析、创建文档和电子表格、操作软件以及在工具间无缝切换直…

作者头像 李华
网站建设 2026/4/25 0:53:08

Agentic AI如何革新网络安全运维:从被动响应到主动防御

1. 从被动响应到主动思考&#xff1a;Agentic AI如何重塑网络安全运维 在网络安全领域&#xff0c;我们正经历一场从"被动防御"到"主动思考"的范式转变。传统安全运维中&#xff0c;分析师平均需要处理超过10000个警报/周&#xff0c;其中95%是误报或低优先…

作者头像 李华