L11-Unit_8-Part_1_Proximal_Policy_Optimization_(PPO)-B1-The_intuition_behind_PPO

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit2/two-types-value-based-methods?fw=pt

The intuition behind PPO

PPO背后的直觉

The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: we want to avoid having too large policy updates.

最近策略优化(PPO)的理念是,我们希望通过限制您在每个培训阶段对策略所做的更改来提高策略的培训稳定性:我们希望避免进行过大的策略更新。

For two reasons:

原因有两个:

  • We know empirically that smaller policy updates during training are more likely to converge to an optimal solution.
  • A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) and having a long time or even no possibility to recover.

Policy Update cliff
Taking smaller policy updates to improve the training stability
Modified version from RL — Proximal Policy Optimization (PPO) Explained by Jonathan Hui
So with PPO, we update the policy conservatively. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range [1−ϵ,1+ϵ] [1 - \epsilon, 1 + \epsilon] [1−ϵ,1+ϵ], meaning that we remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).

我们从经验上知道,训练过程中较小的策略更新更有可能收敛到最优解。策略更新中太大的一步可能会导致掉下悬崖(得到一个糟糕的策略),并有很长时间甚至没有恢复的可能性。策略更新悬崖采取较小的策略更新以提高训练稳定性许宗衡解释的RL-近邻策略优化(PPO)的修改版本因此,我们使用PPO保守地更新策略。要做到这一点,我们需要使用当前政策和以前政策之间的比率计算来衡量当前政策与以前的政策相比发生了多大的变化。我们将这个比率限制在[1−ϵ,1+ϵ][1-\ϵ,1+\epsilon][1−ϵ,1+ϵ]的范围内,这意味着我们消除了当前政策偏离旧政策太远的动机(因此是最近的政策术语)。