Asynchronous Advantage Actor-Critic

1. 公式推导

policy gradient中要求的GtnG_t^n就是Q-Learning中的Q,即
E[Gtn]=Qπθ(stn,atn)b=Vπθ(stn) \begin{aligned} E[G_t^n] = Q^{\pi_\theta}(s_t^n, a_t^n) \\ b = V^{\pi_\theta}(s_t^n) \end{aligned}

复习中的公式(1)中的“()”可以写成:
(t=tTnγttrtnb)=Qπθ(stn,atn)Vπθ(stn)(1) \begin{aligned} (\sum_{t'=t}^{Tn}\gamma^{t'-t}r_{t'}^n-b) = Q^{\pi_\theta}(s_t^n, a_t^n) - V^{\pi_\theta}(s_t^n) && (1) \end{aligned}

结合公式:
Qπθ(stn,atn)=rtn+Vπθ(st+1n)(2) \begin{aligned} Q^{\pi_\theta}(s_t^n, a_t^n) = r_t^n + V^{\pi_\theta}(s_{t+1}^n) && (2) \end{aligned}

公式(2)代入公式(1)得:
(t=tTnγttrtnb)=rtn+Vπθ(st+1n)Vπθ(stn)(3) \begin{aligned} (\sum_{t'=t}^{Tn}\gamma^{t'-t}r_{t'}^n-b) = r_t^n + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n) && (3) \end{aligned}

公式(3)代入Rˉθ\nabla \bar R_\theta得:
R¯θ1Nn=1Nt=1Tn(rtn+Vπθ(st+1n)Vπθ(stn))logpθ(atnstn)(4) \begin{aligned} \nabla \bar R_\theta \approx \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{Tn}\left(r_t^n + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n)\right)\nabla\log p_\theta(a_t^n|s_t^n) && (4) \end{aligned}

2. 训练过程

  1. π\pi与环境互动,sample出labelled data
  2. 用TD或MC,基于labelled data学习Vπθ(s)V^{\pi_\theta}(s)
  3. 根据公式Vπθ(s)V^{\pi_\theta}(s)更新π\pi

注意:
Vπ(s)V^{\pi}(s)增加一个正则化,使Vπ(s)V^{\pi}(s)的entropy倾向于larger

3. Asynchronous

NN开多个影分身同时修行加快训练速度

for every worker:

  1. copy global 参数
  2. 独立地sample data并计算θ\nabla \theta
  3. θ\nabla \theta传到global
  4. global更新参数

results matching ""

    No results matching ""