E4-Unit_2-Introduction_to_Q_Learning-I8-A_Q_Learning_example

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit5/bonus?fw=pt

A Q-Learning example

一个问答学习的例子

To better understand Q-Learning, let’s take a simple example:

为了更好地理解Q-Learning,让我们举一个简单的例子:

Maze-Example

迷宫–例子

  • You’re a mouse in this tiny maze. You always start at the same starting point.
  • The goal is to eat the big pile of cheese at the bottom right-hand corner and avoid the poison. After all, who doesn’t like cheese?
  • The episode ends if we eat the poison, eat the big pile of cheese or if we spent more than five steps.
  • The learning rate is 0.1
  • The gamma (discount rate) is 0.99

Maze-Example
The reward function goes like this:

你是这个迷宫里的一只老鼠。你总是要从同一个起点开始,目标是吃下右下角的一大堆奶酪,避免中毒。毕竟,谁不喜欢奶酪呢?如果我们吃了毒药,吃了一大堆奶酪,或者我们花了五个步骤以上,那么这一集就结束了。学习率是0.1伽玛(贴现率)是0.99迷宫-示例奖励函数是这样的:

  • +0: Going to a state with no cheese in it.
  • +1: Going to a state with a small cheese in it.
  • +10: Going to the state with the big pile of cheese.
  • -10: Going to the state with the poison and thus die.
  • +0 If we spend more than five steps.

Maze-Example
To train our agent to have an optimal policy (so a policy that goes right, right, down), we will use the Q-Learning algorithm.

+0:去没有奶酪的州。+1:去一个有小奶酪的州。+10:带着一大堆奶酪去州。-10:带着毒药去州,这样就死了。+0如果我们花了五个以上的步骤。迷宫-训练我们的代理人有一个最优策略的例子(所以策略是正确的,正确的,向下的),我们将使用Q-学习算法。

Step 1: We initialize the Q-table

步骤1:初始化Q表

Maze-Example
So, for now, our Q-table is useless; we need to train our Q-function using the Q-Learning algorithm.

迷宫-例如,就目前而言,我们的Q表是无用的;我们需要使用Q-学习算法来训练我们的Q函数。

Let’s do it for 2 training timesteps:

让我们来做两个训练时间步长:

Training timestep 1:

培训时间第一步:

Step 2: Choose action using Epsilon Greedy Strategy

第2步:使用Epsilon贪婪策略选择行动

Because epsilon is big = 1.0, I take a random action, in this case, I go right.

因为epsilon大=1.0,所以我采取随机操作,在本例中,我向右转。

Maze-Example

迷宫–例子

Step 3: Perform action At, gets Rt+1 and St+1

步骤3:在执行操作时,获得RT+1和ST+1

By going right, I’ve got a small cheese, so Rt+1=1R_{t+1} = 1Rt+1​=1, and I’m in a new state.

通过向右,我得到了一个小奶酪,所以RT+1=1R_{t+1}=1Rt+1​=1,我处于一个新的状态。

Maze-Example

迷宫–例子

Step 4: Update Q(St, At)

步骤4:更新Q(ST,AT)

We can now update Q(St,At)Q(S_t, A_t)Q(St​,At​) using our formula.

现在我们可以使用我们的公式更新Q(ST,At)Q(S_t,A_t)Q(ST​,At​)。

Maze-Example
Maze-Example
Training timestep 2:

迷宫-示例迷宫-示例培训时间步骤2:

Step 2: Choose action using Epsilon Greedy Strategy

第2步:使用Epsilon贪婪策略选择行动

I take a random action again, since epsilon is big 0.99 (since we decay it a little bit because as the training progress, we want less and less exploration).

我再次采取随机行动,因为epsilon很大,0.99%(因为随着训练的进行,我们想要的探索越来越少,所以我们稍微衰减了一点)。

I took action down. Not a good action since it leads me to the poison.

我采取了行动。这不是一个好的行动,因为它把我引向了毒药。

Maze-Example

迷宫–例子

Step 3: Perform action At, gets Rt+1 and St+1

步骤3:在执行操作时,获得RT+1和ST+1

Because I go to the poison state, I get Rt+1=−10R_{t+1} = -10Rt+1​=−10, and I die.

因为我进入中毒状态,所以我得到RT+1=−10R_{t+1}=-10Rt+1​=−10,然后我就死了。

Maze-Example

迷宫–例子

Step 4: Update Q(St, At)

步骤4:更新Q(ST,AT)

Maze-Example
Because we’re dead, we start a new episode. But what we see here is that with two explorations steps, my agent became smarter.

迷宫–因为我们死了,我们就开始新的一集。但我们在这里看到的是,经过两个探索步骤,我的经纪人变得更聪明了。

As we continue exploring and exploiting the environment and updating Q-values using TD target, Q-table will give us better and better approximations. And thus, at the end of the training, we’ll get an estimate of the optimal Q-function.

随着我们继续探索和利用环境,并使用TD目标更新Q值,Q表将给我们越来越好的近似。因此,在训练结束时,我们将得到最优Q函数的估计。