L11-Unit_8-Part_1_Proximal_Policy_Optimization_(PPO)-A0-Introduction
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit2/what-is-rl?fw=pt
Introduction
引言
![]()
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance with:
第8单元在第6单元中,我们了解了Advantage Actor Critic(A2C),这是一种结合了基于价值和基于策略的方法的混合架构,通过以下方式减少差异,从而帮助稳定培训:
- An Actor that controls how our agent behaves (policy-based method).
- A Critic that measures how good the action taken is (value-based method).
Today we’ll learn about Proximal Policy Optimization (PPO), an architecture that improves our agent’s training stability by avoiding too large policy updates. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio from a specific range [1−ϵ,1+ϵ] [1 - \epsilon, 1 + \epsilon] [1−ϵ,1+ϵ] .
控制代理行为的参与者(基于策略的方法)。衡量所采取操作的好坏的批评者(基于值的方法)。今天,我们将学习最近策略优化(™),这是一种通过避免太大的策略更新来提高代理的培训稳定性的体系结构。为此,我们使用一个指示当前和旧策略之间差异的比率,并从特定范围[1?ˆ‘下一次,1+下一次][1-\epsilon,1+\epsilon][1?ˆ’下一次,1+下一次]中剪裁该比率。
Doing this will ensure that our policy update will not be too large and that the training is more stable.
这样做将确保我们的政策更新不会太大,培训也会更稳定。
This Unit is in two parts:
本单位由两部分组成:
- In this first part, you’ll learn the theory behind PPO and code your PPO agent from scratch using CleanRL implementation. To test its robustness with LunarLander-v2. LunarLander-v2 is the first environment you used when you started this course. At that time, you didn’t know how PPO worked, and now, you can code it from scratch and train it. How incredible is that 🤩.
- In the second part, we’ll get deeper into PPO optimization by using Sample-Factory and train an agent playing vizdoom (an open source version of Doom).

These are the environments you’re going to use to train your agents: VizDoom environments
Sounds exciting? Let’s get started! 🚀
在第一部分中,您将学习PPO背后的理论,并使用™实现从头开始编写您的PPO代理。用LUNARLADER-v2测试它的健壮性。LunarLander-v2是您开始学习本课程时使用的第一个环境。那时,你不知道™是如何工作的,而现在,你可以从头开始对它进行编码和训练。在第二部分中,我们将通过使用Sample-©更深入地了解™优化,并培训一名玩Ÿ(Doom的开源版本)的代理。环境这些是您将用于培训代理的环境:Vizdoom环境听起来很令人兴奋?让?EURO™s开始吧!FINGŸšEURO