J9-Unit_6-Actor_Critic_methods_with_Robotics_environments-A0-Introduction
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unitbonus3/curriculum-learning?fw=pt
Introduction
引言
![]()
In unit 4, we learned about our first Policy-Based algorithm called Reinforce.
在单元4的缩略图中,我们学习了我们的第一个基于策略的算法,称为加强。
In Policy-Based methods, we aim to optimize the policy directly without using a value function. More precisely, Reinforce is part of a subclass of Policy-Based Methods called Policy-Gradient methods. This subclass optimizes the policy directly by estimating the weights of the optimal policy using Gradient Ascent.
在基于策略的方法中,我们的目标是直接优化策略,而不使用值函数。更准确地说,加强是称为策略梯度方法的基于策略的方法的子类的一部分。这个子类通过使用梯度上升估计最优策略的权重来直接优化策略。
We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), we have significant variance in policy gradient estimation.
我们看到增援效果很好。然而,由于我们使用蒙特卡罗抽样来估计收益(我们使用整个事件来计算收益),所以我们在政策梯度估计中有显著的差异。
Remember that the policy gradient estimation is the direction of the steepest increase in return. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, leads to slower training since we need a lot of samples to mitigate it.
请记住,政策梯度估计是回报增长最快的方向。换句话说,如何更新我们的政策权重,使导致良好回报的行动有更高的被采取的可能性。我们将在本单元中进一步研究的蒙特卡罗方差会导致训练速度较慢,因为我们需要大量样本来缓解它。
So, today we’ll study Actor-Critic methods, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance:
因此,今天我们将学习Actor-Critic方法,这是一种结合了基于值的方法和基于策略的方法的混合体系结构,通过减少差异来帮助稳定培训:
- An Actor that controls how our agent behaves (Policy-Based method)
- A Critic that measures how good the taken action is (Value-Based method)
We’ll study one of these hybrid methods, Advantage Actor Critic (A2C), and train our agent using Stable-Baselines3 in robotic environments. We’ll train two robots:
控制代理行为的参与者(基于策略的方法)衡量所采取操作的好坏的批评者(基于值的方法)我们将研究这些混合方法之一,Advantage Actor Critic(A2C),并在机器人环境中使用稳定基线3来培训我们的代理。我们将训练两个机器人:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.

Sounds exciting? Let’s get started!
一只蜘蛛🕷️学习移动。一只机械臂🦾移动到正确的位置。环境听起来很刺激?我们开始吧!