强化学习项目-7-LunarLanderContinuous-v3(DDPG)
环境
本项目使用的是OpenAI Gym提供的经典控制环境的连续动作版本。与PPO笔记中的离散版本不同,这里的动作空间是连续的数值。
官网链接:https://gymnasium.farama.org/environments/box2d/lunar_lander/
动作空间 (Continuous)
动作是一个维度为2的向量a ∈ [ − 1 , 1 ] 2 a \in [-1, 1]^2a∈[−1,1]2:
- 主引擎 (Main Engine):
- − 1 ∼ 0 -1 \sim 0−1∼0: 引擎关闭
- 0 ∼ + 1 0 \sim +10∼+1: 引擎开启,数值越大推力越大(从50%到100%功率)
- 侧向推进器 (Side Engines):
- − 1 ∼ − 0.5 -1 \sim -0.5−1∼−0.5: 右侧推进器开启(推向左)
- − 0.5 ∼ 0.5 -0.5 \sim 0.5−0.5∼0.5: 关闭
- 0.5 ∼ 1 0.5 \sim 10.5∼1: 左侧推进器开启(推向右)
状态向量
与离散版一致,维度为8:
s = [ x , y , x ˙ , y ˙ , θ , θ ˙ , l , r ] T s = [x, y, \dot{x}, \dot{y}, \theta, \dot{\theta}, l, r]^Ts=[x,y,x˙,y˙,θ,θ˙,l,r]T
奖励函数
- 逻辑与离散版基本一致(靠近平台加分、坠毁扣分等)。
- 区别:连续动作版本中,喷射燃料的扣分是根据动作的连续数值计算的,因此更鼓励“精准控制”油门,而非频繁的开关。
引入环境
注意需要指定continuous=True。
importgymnasiumasgym# 必须指定 continuous=Trueenv=gym.make("LunarLander-v3",continuous=True,render_mode="human")state_dim=env.observation_space.shape[0]action_dim=env.action_space.shape[0]max_action=float(env.action_space.high[0])# 通常为 1.0DDPG 算法
DDPG (Deep Deterministic Policy Gradient) 是一种基于Actor-Critic架构的算法,专门用于解决连续动作空间的问题。它结合了 DQN 的思想(经验回放、目标网络)和确定性策略梯度。
核心组件
- Actor 网络 (μ \muμ): 输入状态s ss,直接输出确定的动作值a aa。
- Critic 网络 (Q QQ): 输入状态s ss和动作a aa,输出该动作的价值Q ( s , a ) Q(s, a)Q(s,a)。
- 目标网络 (Target Networks):μ ′ \mu'μ′和Q ′ Q'Q′,用于计算TD目标,保持训练稳定。
- 经验回放池 (Replay Buffer): 存储( s , a , r , s ′ , d o n e ) (s, a, r, s', done)(s,a,r,s′,done),打破数据相关性。
损失函数
1. Critic 损失 (Value Loss)
Critic 的目标是最小化预测的 Q 值与 TD Target 之间的均方误差:
L = 1 N ∑ ( y i − Q ( s i , a i ∣ θ Q ) ) 2 L = \frac{1}{N} \sum (y_i - Q(s_i, a_i|\theta^Q))^2L=N1∑(yi−Q(si,ai∣θQ))2
其中目标值y i y_iyi由目标网络计算:
y i = r i + γ Q ′ ( s i + 1 , μ ′ ( s i + 1 ∣ θ μ ′ ) ∣ θ Q ′ ) ⋅ ( 1 − d i ) y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'}) \cdot (1 - d_i)yi=ri+γQ′(si+1,μ′(si+1∣θμ′)∣θQ′)⋅(1−di)
2. Actor 损失 (Policy Loss)
Actor 的目标是最大化 Critic 对其输出动作的评分。在梯度下降中,我们通过最小化 Q 值的负数来实现:
J ( θ μ ) = − 1 N ∑ Q ( s i , μ ( s i ∣ θ μ ) ∣ θ Q ) J(\theta^\mu) = - \frac{1}{N} \sum Q(s_i, \mu(s_i|\theta^\mu)|\theta^Q)J(θμ)=−N1∑Q(si,μ(si∣θμ)∣θQ)
探索策略 (Exploration)
由于 DDPG 是确定性策略,为了让智能体探索环境,我们在训练时给动作添加噪声:
a e x e c = clip ( μ ( s ) + N , − a m a x , a m a x ) a_{exec} = \text{clip}(\mu(s) + \mathcal{N}, -a_{max}, a_{max})aexec=clip(μ(s)+N,−amax,amax)
本项目中使用高斯噪声 (Gaussian Noise),并随训练进行衰减。
高斯噪声代码(gemini3pro生成):
classGaussianNoise:def__init__(self,action_dim,sigma=0.1):self.action_dim=action_dim self.sigma=sigma# 标准差,控制噪声大小defsample(self):# 生成标准正态分布噪声 * sigmareturnnp.random.normal(0,self.sigma,size=self.action_dim)软更新 (Soft Update)
不同于 DQN 的硬更新,DDPG 采用软更新来缓慢更新目标网络参数:
θ ′ ← τ θ + ( 1 − τ ) θ ′ \theta' \leftarrow \tau \theta + (1 - \tau) \theta'θ′←τθ+(1−τ)θ′
其中τ \tauτ通常取极小值 (如 0.005,亲测选择0.01在当前环境下模型训练效果不佳)。
代码实现
模型定义 (Actor & Critic)
注意这里Critic网络输出的是Q ( s , a ) Q(s, a)Q(s,a),因此输入层节点个数为状态与动作维度之和
fromtorchimportnnimporttorchimporttorch.nn.functionalasFclassActor(nn.Module):def__init__(self,state_dim,action_dim,max_action,hidden_dim=256):super(Actor,self).__init__()self.net=nn.Sequential(nn.Linear(state_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,action_dim),nn.Tanh(),)self.max_action=max_actiondefforward(self,state):returnself.net(state)*self.max_actiondefact(self,state):returnself.forward(state)classCritic(nn.Module):def__init__(self,state_dim,action_dim,hidden_dim=256):super(Critic,self).__init__()self.net=nn.Sequential(nn.Linear(state_dim+action_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,hidden_dim),nn.ReLU(),nn.Linear(hidden_dim,1),)defforward(self,state,action):returnself.net(torch.cat([state,action],dim=1))DDPG 类完整代码
classDDPG:def__init__(self,state_dim,action_dim,max_action,device='cuda'iftorch.cuda.is_available()else'cpu',hidden_dim=256,batch_size=256,gamma=0.99,tau=0.001,replay_buffer_size=5000,actor_lr=3e-4,critic_lr=3e-4):self.device=device self.batch_size=batch_size self.gamma=gamma self.tau=tau self.max_action=max_action self.replay_buffer=ReplayBuffer(state_dim,action_dim,replay_buffer_size)self.actor=Actor(state_dim,action_dim,max_action,hidden_dim).to(self.device)self.target_actor=Actor(state_dim,action_dim,max_action,hidden_dim).to(self.device)self.target_actor.load_state_dict(self.actor.state_dict())self.critic=Critic(state_dim,action_dim,hidden_dim).to(self.device)self.target_critic=Critic(state_dim,action_dim,hidden_dim).to(self.device)self.target_critic.load_state_dict(self.critic.state_dict())self.actor_optimizer=optim.Adam(self.actor.parameters(),lr=actor_lr)self.critic_optimizer=optim.Adam(self.critic.parameters(),lr=critic_lr)defact(self,state):state=torch.FloatTensor(state).unsqueeze(0).to(self.device)withtorch.no_grad():action=self.actor(state)returnaction.cpu().numpy().flatten()defstore_transition(self,state,action,reward,next_state,done):self.replay_buffer.add(state,action,reward,next_state,done)defsample(self):returnself.replay_buffer.sample(self.batch_size)deftrain(self):ifself.replay_buffer.size<self.batch_size:returnstates,actions,rewards,next_states,dones=self.replay_buffer.sample(self.batch_size,device=self.device)withtorch.no_grad():next_actions=self.target_actor(next_states)td_targets=rewards+self.gamma*self.target_critic(next_states,next_actions)*(1-dones)current_q=self.critic(states,actions)critic_loss=F.mse_loss(current_q,td_targets)actor_loss=-self.critic(states,self.actor(states)).mean()self.actor_optimizer.zero_grad()actor_loss.backward()self.actor_optimizer.step()self.critic_optimizer.zero_grad()critic_loss.backward()self.critic_optimizer.step()forparam,target_paraminzip(self.actor.parameters(),self.target_actor.parameters()):target_param.data.copy_(param.data*self.tau+target_param.data*(1-self.tau))forparam,target_paraminzip(self.critic.parameters(),self.target_critic.parameters()):target_param.data.copy_(param.data*self.tau+target_param.data*(1-self.tau))defsave(self,filename):""" 保存所有网络参数和优化器状态到一个文件 """torch.save({'actor':self.actor.state_dict(),'critic':self.critic.state_dict(),'target_actor':self.target_actor.state_dict(),'target_critic':self.target_critic.state_dict(),'actor_optimizer':self.actor_optimizer.state_dict(),'critic_optimizer':self.critic_optimizer.state_dict(),},filename)defload(self,filename):""" 加载模型参数 """# map_location 确保在 CPU 机器上也能加载 GPU 训练的模型,反之亦然checkpoint=torch.load(filename,map_location=self.device)self.actor.load_state_dict(checkpoint['actor'])self.critic.load_state_dict(checkpoint['critic'])self.target_actor.load_state_dict(checkpoint['target_actor'])self.target_critic.load_state_dict(checkpoint['target_critic'])self.actor_optimizer.load_state_dict(checkpoint['actor_optimizer'])self.critic_optimizer.load_state_dict(checkpoint['critic_optimizer'])训练流程
训练中加入了高斯噪声衰减机制和预热(Warmup阶段,以平衡探索与利用。
预热:设置在前5000步内使用系统的随机动作,并且不进行模型训练。
importgymnasiumasgym,torch,numpyasnp,matplotlib.pyplotaspltfromALG.DRL.DDPGimportDDPGfromtqdmimporttqdmfromUtils.NoiseimportGaussianNoisefromUtils.SmoothimportSmooth env=gym.make('LunarLander-v3',continuous=True,render_mode=None)state_dim=env.observation_space.shape[0]iflen(env.observation_space.shape)==1elseenv.observation_space.n action_dim=env.action_space.shape[0]max_action=float(env.action_space.high[0])model=DDPG(state_dim,action_dim,max_action,replay_buffer_size=100000,tau=0.005,actor_lr=1e-4,batch_size=512)noise=GaussianNoise(action_dim)scores=[]episodes=3000step=0warmup_steps=5000noise_decay=0.999min_noise=0.01max_value=200update_interval=4pbar=tqdm(range(episodes),desc="Training")forepisodeinpbar:ifstep>warmup_steps:noise.sigma=max(min_noise,noise.sigma*noise_decay)done=Falsestate,_=env.reset()score=0whilenotdone:step+=1ifstep<=warmup_steps:action=env.action_space.sample()else:action=model.act(state)action=(action+noise.sample()).clip(-max_action,max_action)next_state,reward,termination,truncated,_=env.step(action)done=terminationortruncated score+=reward model.store_transition(state,action,reward,next_state,done)state=next_stateifstep>warmup_stepsandstep%update_interval==0:model.train()scores.append(score)pbar.set_postfix(ep=episode,score=f"{score:.2f}",avg100=f"{np.mean(scores[-100:]):.2f}")ifnp.mean(scores[-100:])>max_value:model.save("../../model/lunarLanderContinuous-DDPG.pth")smooth=Smooth(scores)smooth.show(title="DDPG in LunarLander-v3-continuous")训练结果
经过参数调整(增大 Batch Size 至 512,增大 Buffer 至 100000),模型成功收敛。
可以看到,模型在前期(约前800轮)处于探索阶段,分数较低;在预热结束且 Buffer 充足后,分数迅速上升,最终稳定在 200 分以上,实现了平稳着陆。