E4-Unit_2-Introduction_to_Q_Learning-H7-Introducing_Q_Learning

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit5/hands-on?fw=pt

Introducing Q-Learning

Q-Learning简介

What is Q-Learning?

什么是Q-Learning?

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:

Q-Learning是一种完全非策略的基于价值的方法,它使用TD方法来训练其动作价值函数:

  • Off-policy: we’ll talk about that at the end of this unit.
  • Value-based method: finds the optimal policy indirectly by training a value or action-value function that will tell us the value of each state or each state-action pair.
  • Uses a TD approach: updates its action-value function at each step instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-function, an action-value function that determines the value of being at a particular state and taking a specific action at that state.

非策略:我们将在本单元结束时讨论这一点。基于值的方法:通过训练值或动作值函数来间接找到最优策略,该函数将告诉我们每个状态或每个状态-动作对的值。使用TD方法:它在每一步更新其动作值函数,而不是在剧集结束时。Q-学习是我们用来训练Q函数的算法,Q函数是一个重要的动作值函数,它确定处于特定状态并在该状态采取特定动作的值。

Q-function
Given a state and action, our Q Function outputs a state-action value (also called Q-value)
The Q comes from “the Quality” (the value) of that action at that state.

Q-函数给定一个状态和动作,我们的Q函数输出一个状态-动作值(也称为Q-值)。Q来自该动作在该状态下的“质量”(值)。

Let’s recap the difference between value and reward:

让我们回顾一下价值和回报之间的区别:

  • The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
  • The reward is the feedback I get from the environment after performing an action at a state.

Internally, our Q-function has a Q-table, a table where each cell corresponds to a state-action pair value. Think of this Q-table as the memory or cheat sheet of our Q-function.

状态或状态-动作对的值是我们的代理从该状态(或状态-动作对)开始并根据其策略行事时获得的预期累积奖励。奖励是我在一个状态执行动作后从环境中获得的反馈。在内部,我们的Q-函数有一个Q-表,其中每个单元格对应一个状态-动作对值。我将这个Q-表视为Q-函数的内存或小抄。

Let’s go through an example of a maze.

让我们来看一个迷宫的例子。

Maze example
The Q-table is initialized. That’s why all values are = 0. This table contains, for each state, the four state-action values.

以迷宫为例,初始化了Q表。这就是为什么所有值都是=0的原因。此表包含每个州的四个状态行动值。

Maze example
Here we see that the state-action value of the initial state and going up is 0:

迷宫示例在这里,我们看到初始状态和向上的最大状态-动作值为0:

Maze example
Therefore, Q-function contains a Q-table that has the value of each-state action pair. And given a state and action, our Q-function will search inside its Q-table to output the value.

迷宫示例因此,Q-函数包含一个Q-表,它具有每个状态动作对的值。并且给定一个状态和动作,我们的Q-函数将在其Q-表中搜索以输出值。

Q-function
If we recap, Q-Learning is the RL algorithm that:

Q-函数如果我们回顾一下,Q-Learning就是RL算法,它:

  • Trains a Q-function (an action-value function), which internally is a Q-table that contains all the state-action pair values.
  • Given a state and action, our Q-function will search into its Q-table the corresponding value.
  • When the training is done, we have an optimal Q-function, which means we have optimal Q-table.
  • And if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.

Link value policy
But, in the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time, we initialize the Q-table to 0). As the agent explores the environment and we update the Q-table, it will give us better and better approximations to the optimal policy.

训练一个Q函数(一个动作-值函数),它内部是一个包含所有状态-动作对值的Q-表。给定一个状态和动作,我们的Q-函数将在它的Q-表中查找相应的值。当训练完成时,我们有一个最优的Q-函数,这意味着我们有最优的Q-表。如果我们有一个最优的Q-函数,我们就会有一个最优的策略,因为我们知道每个状态采取什么是最好的动作。链接值策略,但在一开始,我们的Q表是无用的,因为它为每个状态-动作对提供任意值(大多数情况下,我们将Q表初始化为0)。随着代理探索环境和我们更新Q表,它将为我们提供越来越接近最优策略的结果。

Q-learning
We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.
Now that we understand what Q-Learning, Q-function, and Q-table are, let’s dive deeper into the Q-Learning algorithm.

Q-学习我们在这里看到,通过训练,我们的Q表更好,因为多亏了它,我们可以知道每个状态-动作对的值。现在我们了解了Q-学习、Q-函数和Q-表是什么,接下来让我们更深入地研究Q-学习算法。

The Q-Learning algorithm

Q-学习算法

This is the Q-Learning pseudocode; let’s study each part and see how it works with a simple example before implementing it. Don’t be intimidated by it, it’s simpler than it looks! We’ll go over each step.

这是Q-Learning伪代码;在实现它之前,让我们研究每个部分,并通过一个简单的例子来看看它是如何工作的。别被它吓倒了,它比看起来简单!我们会仔细检查每一步。

Q-learning

Q-学习

Step 1: We initialize the Q-table

步骤1:初始化Q表

Q-learning
We need to initialize the Q-table for each state-action pair. Most of the time, we initialize with values of 0.

Q-学习我们需要为每个状态-动作对初始化Q表。大多数情况下,我们使用值0进行初始化。

Step 2: Choose action using epsilon-greedy strategy

步骤2:使用epsilon-贪婪策略选择行动

Q-learning
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.

Q-学习Epsilon贪婪策略是一种处理勘探/开采权衡的策略。

The idea is that we define the initial epsilon ɛ = 1.0:

我们的想法是,我们定义初始epsilonɛ=1.0:

  • With probability 1 — ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
  • With probability ɛ: we do exploration (trying random action).

At the beginning of the training, the probability of doing exploration will be huge since ɛ is very high, so most of the time, we’ll explore. But as the training goes on, and consequently our Q-table gets better and better in its estimations, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.

概率1-ɛ:我们做开采(也就是我们的代理人选择状态-动作对值最高的动作)。概率ɛ:我们做勘探(尝试随机动作)。在训练开始时,做探索的概率将是巨大的,因为ɛ非常高,所以大多数情况下,我们会进行探索。但随着训练的进行,因此我们的QQ表在其估计中变得越来越好,我们逐渐减少epsilon值,因为我们将需要越来越少的探索和更多的开采。

Q-learning

Q-学习

Step 3: Perform action At, gets reward Rt+1 and next state St+1

步骤3:在执行操作时,获得奖励RT+1和下一状态ST+1

Q-learning

Q-学习

Step 4: Update Q(St, At)

步骤4:更新Q(ST,AT)

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) after one step of the interaction.

请记住,在TD学习中,我们在互动的一个步骤之后更新我们的政策或价值函数(取决于我们选择的RL方法)。

To produce our TD target, we used the immediate reward Rt+1R_{t+1}Rt+1​ plus the discounted value of the next state best state-action pair (we call that bootstrap).

为了产生我们的TD目标,我们使用即时奖励RT+1R_{t+1}RT+1​加上下一个状态最佳状态-动作对的折扣值(我们称之为引导)。

Q-learning
Therefore, our Q(St,At)Q(S_t, A_t)Q(St​,At​) update formula goes like this:

Q-学习因此,我们的Q(ST,At)Q(S_t,A_t)Q(St​,At​)更新公式如下:

Q-learning
It means that to update our Q(St,At)Q(S_t, A_t)Q(St​,At​):

Q-学习意味着更新我们的Q(ST,At)Q(S_t,A_t)Q(ST​,at​):

  • We need St,At,Rt+1,St+1S_t, A_t, R_{t+1}, S_{t+1}St​,At​,Rt+1​,St+1​.
  • To update our Q-value at a given state-action pair, we use the TD target.

How do we form the TD target?

我们需要ST,At,RT+1,ST+1s_t,A_t,R_{t+1},S_{t+1}ST​,At​,RT+1​,ST+1​。为了更新给定状态-动作对的Q值,我们使用TD目标。我们如何形成TD目标?

  1. We obtain the reward after taking the action Rt+1R_{t+1}Rt+1​.
  2. To get the best next-state-action pair value, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our action using a epsilon-greedy policy again.

我们在采取行动RT+1Rt+1+1​后获得奖励,为了获得最优的下一状态-行动对值,我们使用贪婪策略来选择下一个最好的行动。请注意,这不是一个贪婪的策略,它将始终采取状态操作值最高的操作。然后,当完成此Q值的更新时,我们从一个新的状态开始,并再次使用贪婪的策略选择我们的操作。

This is why we say that Q Learning is an off-policy algorithm.

这就是为什么我们说Q学习是一种非策略算法。

Off-policy vs On-policy

政策外与政策上

The difference is subtle:

差别是微妙的:

  • Off-policy: using a different policy for acting (inference) and updating (training).

For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is used to select the best next-state action value to update our Q-value (updating policy).

离策略:使用不同的策略进行行为(推理)和更新(训练)。例如,在Q-学习中,贪婪策略(行为策略)不同于贪婪策略,贪婪策略用于选择下一个状态的最佳动作值来更新我们的Q值(更新策略)。

Off-on policy
Acting Policy
Is different from the policy we use during the training part:

关机策略代理策略与我们在培训部分使用的策略不同:

Off-on policy
Updating policy

关闭-打开策略更新策略

  • On-policy: using the same policy for acting and updating.

For instance, with Sarsa, another value-based algorithm, the epsilon-greedy policy selects the next state-action pair, not a greedy policy.

在策略上:使用相同的策略来执行和更新。例如,使用另一种基于值的算法SARSA,贪婪策略选择下一个状态-动作对,而不是贪婪策略。

Off-on policy
Sarsa
Off-on policy

断断续续政策Sarsa断断续续政策