前言:本文是为了方便机器人初学者快速学习Mujoco+强化学习而设计的教程,循序渐进,从环境搭建到简单的运动控制再到强化学习自主探索,难度逐步提升,帮助初学者建立学习路线,思维框架,并在此基础上实现创新,探索自定义功能。所有工程均提供代码参考(https://github.com/gimmyy/Tutorial_Mujoco_RL/tree/main),鼓励理论结合实践和二次开发,参考链接附在文末。
注:笔者所用操作系统为win10,因系统差异,个别代码直接复制可能会运行失败,建议根据自己的操作系统修改代码。训练环境为纯CPU,对无GPU、算力受限的初学者非常友好。
一、Mujoco安装和环境搭建
Mujoco包存在新老版本,很多搜出来的教程还是老版本Mujoco200、210的安装,需要导入密钥,手动配置安装位置,存在一定的误导。事实上,2022年以后Mujoco已经开始由DeepMind发行标准python接口的新版本;老版本Mujoco 2xx过去由OpenAI维护,已经停止更新,我们不安装老版本,新版本作为标准的python库,只需两行代码即可自动安装。
1. 创建conda虚拟环境并进入
conda create -n orca python=3.11conda activate orca2. 安装Mujoco库
pip install mujoco系统会自动安装和python版本匹配的mujoco库
3. 测试
命令行输入python进入python环境
python逐行输入如下命令
import mujocoprint(mujoco.__version__)说明安装的版本是3.6.0,安装成功。
二、实战1:Mujoco加载机器人模型实现简单控制
学习要点:Mujoco工程构建、机器人模型文件MJCF的修改和导入、仿真流程、结果可视化
本节参考博文https://blog.csdn.net/Vint_LU/article/details/147948556
该工程利用ur5e机械臂实现固定位置物体的推动。
通过该文章的复现可以实现Mujoco工程的构建、机器人模型文件MJCF的修改、模型导入、仿真流程、结果可视化的完整闭环。
三、实战2:Mujoco+RL倒立摆
学习要点:强化学习训练框架搭建、mujoco物理环境的适配、训练一个强化学习智能体
本节参考gymnasium的官方文档 https://gymnasium.org.cn/tutorials/training_agents/mujoco_reinforce/
官方提供的代码无法进行仿真可视化,我增加了可视化模块,可参考https://github.com/gimmyy/Tutorial_Mujoco_RL/tree/main/2tutorial_InvertedPendulum
该工程实现在mujoco物理仿真环境中用强化学习方法控制倒立摆InvertedPendulum。
强化学习算法采用policy gradient,模型使用自定义模型Policy_Network。
class Policy_Network(nn.Module): """Parametrized Policy Network.""" def __init__(self, obs_space_dims: int, action_space_dims: int): """Initializes a neural network that estimates the mean and standard deviation of a normal distribution from which an action is sampled from. Args: obs_space_dims: Dimension of the observation space action_space_dims: Dimension of the action space """ super().__init__() hidden_space1 = 16 # Nothing special with 16, feel free to change hidden_space2 = 32 # Nothing special with 32, feel free to change # Shared Network self.shared_net = nn.Sequential( nn.Linear(obs_space_dims, hidden_space1), nn.Tanh(), nn.Linear(hidden_space1, hidden_space2), nn.Tanh(), ) # Policy Mean specific Linear Layer self.policy_mean_net = nn.Sequential( nn.Linear(hidden_space2, action_space_dims) ) # Policy Std Dev specific Linear Layer self.policy_stddev_net = nn.Sequential( nn.Linear(hidden_space2, action_space_dims) ) def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: """Conditioned on the observation, returns the mean and standard deviation of a normal distribution from which an action is sampled from. Args: x: Observation from the environment Returns: action_means: predicted mean of the normal distribution action_stddevs: predicted standard deviation of the normal distribution """ shared_features = self.shared_net(x.float()) action_means = self.policy_mean_net(shared_features) action_stddevs = torch.log( 1 + torch.exp(self.policy_stddev_net(shared_features)) ) return action_means, action_stddevsInvertedPendulum-v4环境的封装已经被集成在gym库中。
env = gym.make("InvertedPendulum-v4") wrapped_env = gym.wrappers.RecordEpisodeStatistics(env, 50) # Records episode-rewardstep函数、奖励函数也不需要自己写,官方为InvertedPendulum-v4提供了现成的函数,直接调用即可。
obs, reward, terminated, truncated, info = wrapped_env.step(action)运行代码,可以在mujoco仿真环境下看到倒立摆的实时状态。
我们发现gymnasium库对mujoco的兼容非常好,为后续进行复杂控制的强化学习创造了很大的便利。
四、实战3:Mujoco+RL控制机器人避障路径规划
学习要点:机器人自定义环境类的编写,奖励函数的设计,stable_baselines3库对强化学习模型的封装
本节参考https://github.com/LitchiCheng/mujoco-learning
| rl_panda_reach_target_high_profile.py | Mujoco 仿真 PPO 强化学习机械臂末端路径规划到达指定位置 |
该工程利用PPO强化学习控制机械臂末端绕过障碍物到达指定目标位置,和上一节不同的是,由于stable_baselines3库的引入,强化学习模型不需要自定义,可以直接从库中调用即可,只需通过参数设置改变模型结构和超参数,而env环境类的编写和奖励函数需要用户根据控制目标进行设计调整,这个方式可以让用户更定多聚焦在机器人控制侧,而不用过度纠结强化学习模型的设计。
环境类定义,继承自gym.Env类,和stable_baselines3中提供的模型接口适配,在PPO模型初始化时可以作为env参数直接传入。
class PandaObstacleEnv(gym.Env): def __init__(self, visualize: bool = False): super(PandaObstacleEnv, self).__init__() if not check_flag_file(): write_flag_file() self.visualize = visualize else: self.visualize = False self.handle = None self.model = mujoco.MjModel.from_xml_path('./model/franka_emika_panda/scene_pos_with_obstacles.xml') self.data = mujoco.MjData(self.model) if self.visualize: self.handle = mujoco.viewer.launch_passive(self.model, self.data) self.handle.cam.distance = 3.0 self.handle.cam.azimuth = 0.0 self.handle.cam.elevation = -30.0 self.handle.cam.lookat = np.array([0.2, 0.0, 0.4]) self.np_random = np.random.default_rng(None) self.end_effector_id = mujoco.mj_name2id(self.model, mujoco.mjtObj.mjOBJ_BODY, 'ee_center_body') self.home_joint_pos = np.array(self.model.key_qpos[0][:7], dtype=np.float32) self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(7,), dtype=np.float32) # 7轴关节角度、目标位置、障碍物位置和球体直径 self.obs_size = 7 + 3 + 3 + 1 self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self.obs_size,), dtype=np.float32) # self.goal_position = np.array([0.4, 0.3, 0.4], dtype=np.float32) self.goal_position = np.array([0.4, -0.3, 0.4], dtype=np.float32) self.goal_arrival_threshold = 0.005 self.goal_visu_size = 0.02 self.goal_visu_rgba = [0.1, 0.3, 0.3, 0.8] for i in range(self.model.ngeom): name = mujoco.mj_id2name(self.model, mujoco.mjtObj.mjOBJ_GEOM, i) if name == "obstacle_0": self.obstacle_id_1 = i self.obstacle_position = np.array(self.model.geom_pos[self.obstacle_id_1], dtype=np.float32) self.obstacle_size = self.model.geom_size[self.obstacle_id_1][0] self.last_action = self.home_joint_pos def _render_scene(self) -> None: if not self.visualize or self.handle is None: return self.handle.user_scn.ngeom = 0 total_geoms = 1 self.handle.user_scn.ngeom = total_geoms mujoco.mjv_initGeom( self.handle.user_scn.geoms[0], mujoco.mjtGeom.mjGEOM_SPHERE, size=[self.goal_visu_size, 0.0, 0.0], pos=self.goal_position, mat=np.eye(3).flatten(), rgba=np.array(self.goal_visu_rgba, dtype=np.float32) ) def reset(self, seed: Optional[int] = None, options: Optional[dict] = None) -> tuple[np.ndarray, dict]: super().reset(seed=seed) if seed is not None: self.np_random = np.random.default_rng(seed) # 重置关节到home位姿 mujoco.mj_resetData(self.model, self.data) self.data.qpos[:7] = self.home_joint_pos self.data.qpos[7:] = [0.04,0.04] mujoco.mj_forward(self.model, self.data) self.goal_position = np.array([self.goal_position[0], self.np_random.uniform(-0.3, 0.3), self.goal_position[2]], dtype=np.float32) self.obstacle_position = np.array([self.obstacle_position[0], self.np_random.uniform(-0.3, 0.3), self.obstacle_position[2]], dtype=np.float32) self.model.geom_pos[self.obstacle_id_1] = self.obstacle_position mujoco.mj_step(self.model, self.data) if self.visualize: self._render_scene() obs = self._get_observation() self.start_t = time.time() return obs, {} def _get_observation(self) -> np.ndarray: joint_pos = self.data.qpos[:7].copy().astype(np.float32) return np.concatenate([joint_pos, self.goal_position, self.obstacle_position + np.random.normal(0, 0.001, size=3), np.array([self.obstacle_size], dtype=np.float32)]) def _calc_reward(self, joint_angles: np.ndarray, action: np.ndarray) -> tuple[np.ndarray, float]: now_ee_pos = self.data.body(self.end_effector_id).xpos.copy() dist_to_goal = np.linalg.norm(now_ee_pos - self.goal_position) # 非线性距离奖励 if dist_to_goal < self.goal_arrival_threshold: distance_reward = 20.0*(1.0+(1.0-(dist_to_goal / self.goal_arrival_threshold))) elif dist_to_goal < 2*self.goal_arrival_threshold: distance_reward = 10.0*(1.0+(1.0-(dist_to_goal / 2*self.goal_arrival_threshold))) elif dist_to_goal < 3*self.goal_arrival_threshold: distance_reward = 5.0*(1.0+(1.0-(dist_to_goal / 3*self.goal_arrival_threshold))) else: distance_reward = 1.0 / (1.0 + dist_to_goal) # 平滑惩罚 smooth_penalty = 0.001 * np.linalg.norm(action - self.last_action) # 碰撞惩罚 contact_reward = 10.0 * self.data.ncon # 关节角度限制惩罚 joint_penalty = 0.0 for i in range(7): min_angle, max_angle = self.model.jnt_range[:7][i] if joint_angles[i] < min_angle: joint_penalty += 0.5 * (min_angle - joint_angles[i]) elif joint_angles[i] > max_angle: joint_penalty += 0.5 * (joint_angles[i] - max_angle) time_penalty = 0.001 * (time.time() - self.start_t) total_reward = (distance_reward - contact_reward - smooth_penalty - joint_penalty - time_penalty) self.last_action = action.copy() return total_reward, dist_to_goal, self.data.ncon > 0 def step(self, action: np.ndarray) -> tuple[np.ndarray, np.float32, bool, bool, dict]: joint_ranges = self.model.jnt_range[:7] scaled_action = np.zeros(7, dtype=np.float32) for i in range(7): scaled_action[i] = joint_ranges[i][0] + (action[i] + 1) * 0.5 * (joint_ranges[i][1] - joint_ranges[i][0]) self.data.ctrl[:7] = scaled_action self.data.qpos[7:] = [0.04,0.04] mujoco.mj_step(self.model, self.data) reward, dist_to_goal, collision = self._calc_reward(self.data.qpos[:7], action) terminated = False if collision: reward -= 10.0 terminated = True if dist_to_goal < self.goal_arrival_threshold: terminated = True print(f"[成功] 距离目标: {dist_to_goal:.3f}, 奖励: {reward:.3f}") if not terminated: if time.time() - self.start_t > 20.0: reward -= 10.0 print(f"[超时] 时间过长,奖励减半") terminated = True if self.visualize and self.handle is not None: self.handle.sync() time.sleep(0.01) obs = self._get_observation() info = { 'is_success': not collision and terminated and (dist_to_goal < self.goal_arrival_threshold), 'distance_to_goal': dist_to_goal, 'collision': collision } return obs, reward.astype(np.float32), terminated, False, info def seed(self, seed: Optional[int] = None) -> list[Optional[int]]: self.np_random = np.random.default_rng(seed) return [seed] def close(self) -> None: if self.visualize and self.handle is not None: self.handle.close() self.handle = NonePPO训练模型,直接调用,并设置超参数。
env = make_vec_env( env_id=lambda: PandaObstacleEnv(** ENV_KWARGS), n_envs=n_envs, seed=42, vec_env_cls=SubprocVecEnv, vec_env_kwargs={"start_method": "spawn"}#"fork" ) model = PPO( policy="MlpPolicy", env=env, policy_kwargs=POLICY_KWARGS, verbose=1, n_steps=2048, batch_size=2048, n_epochs=10, gamma=0.99, # ent_coef=0.02, # 增加熵系数,保留后期探索以提升泛化性 ent_coef = 0.001, clip_range=0.15, # 限制策略更新幅度 max_grad_norm=0.5, # 梯度裁剪防止爆炸 learning_rate=lambda f: 1e-4 * (1 - f), # 学习率线性衰减(初始1e-4,后期逐步降低) device="cuda" if torch.cuda.is_available() else "cpu", tensorboard_log="./tensorboard/panda_obstacle_avoidance/" )模型测试效果,末端执行器附近的小圆球表示目标位置,灰色大圆球表示障碍物,末端执行器可以绕过障碍物到达目标位置。
五、实战4:Mujoco+Orcahand 控制魔方翻转
学习要点:orcahand模型文件理解、自定义场景和训练框架
本节参考orcahand开源github https://github.com/orcahand/orca_sim,作者仅提供cube_orientation的随机策略,我们基于该任务框架,训练强化学习模型,完成灵巧手控制魔方翻转。
首先进行环境类定义,因为代码较多,仅选择重要的点列举,首先定义父类BaseOrcaHandEnv,继承自gym.Env。
class BaseOrcaHandEnv(gym.Env[np.ndarray, np.ndarray]):接着定义魔方翻转任务子类,继承自父类BaseOrcaHandEnv。
class OrcaHandRightCubeOrientation(BaseOrcaHandEnv):在该类中,定义观测、奖励函数。
def _get_obs(self) -> np.ndarray: base = super()._get_obs() if not hasattr(self, "_cube_qpos_adr"): return base return np.concatenate([base, self._cube_red_face_world_normal(), np.array([self._red_face_up_alignment()], dtype=np.float64)]) def _get_reward(self) -> float: align = 0.5 * (self._red_face_up_alignment() + 1) lift = np.clip(self.data.xpos[self._cube_body_id,2]-0.12, 0,0.12)/0.12 drop = 1.0 if self._cube_dropped() else 0.0 return float(align + 0.1*lift - drop)PPO训练模型
model = PPO( "MlpPolicy", env, policy_kwargs=dict(activation_fn=nn.LeakyReLU, net_arch=dict(pi=[256,128], vf=[256,128])), learning_rate=3e-4, batch_size=256, n_steps=1024, gamma=0.99, verbose=1 )训练效果:手指不停活动,使魔方的红面朝上。
相信通过以上学习,你已经建立了Mujoco+RL的认识框架,接下去可以根据自己的目标任务和机器人选型进行自主训练调优啦!
参考链接
[1] https://blog.csdn.net/Vint_LU/article/details/147948556
[2] https://gymnasium.org.cn/tutorials/training_agents/mujoco_reinforce/
[3] https://github.com/LitchiCheng/mujoco-learning
[4] https://github.com/orcahand/orca_sim