中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/deep-rl-course/unit5/bonus?fw=pt

A Q-Learning example

一个问答学习的例子

To better understand Q-Learning, let’s take a simple example:

为了更好地理解Q-Learning，让我们举一个简单的例子：

Maze-Example

迷宫–例子

You’re a mouse in this tiny maze. You always start at the same starting point.
The goal is to eat the big pile of cheese at the bottom right-hand corner and avoid the poison. After all, who doesn’t like cheese?
The episode ends if we eat the poison, eat the big pile of cheese or if we spent more than five steps.
The learning rate is 0.1
The gamma (discount rate) is 0.99

Maze-Example
The reward function goes like this:

你是这个迷宫里的一只老鼠。你总是要从同一个起点开始，目标是吃下右下角的一大堆奶酪，避免中毒。毕竟，谁不喜欢奶酪呢？如果我们吃了毒药，吃了一大堆奶酪，或者我们花了五个步骤以上，那么这一集就结束了。学习率是0.1伽玛(贴现率)是0.99迷宫-示例奖励函数是这样的：

+0: Going to a state with no cheese in it.
+1: Going to a state with a small cheese in it.
+10: Going to the state with the big pile of cheese.
-10: Going to the state with the poison and thus die.
+0 If we spend more than five steps.

Maze-Example
To train our agent to have an optimal policy (so a policy that goes right, right, down), we will use the Q-Learning algorithm.

+0：去没有奶酪的州。+1：去一个有小奶酪的州。+10：带着一大堆奶酪去州。-10：带着毒药去州，这样就死了。+0如果我们花了五个以上的步骤。迷宫-训练我们的代理人有一个最优策略的例子(所以策略是正确的，正确的，向下的)，我们将使用Q-学习算法。

Step 1: We initialize the Q-table

步骤1：初始化Q表

Maze-Example
So, for now, our Q-table is useless; we need to train our Q-function using the Q-Learning algorithm.

迷宫-例如，就目前而言，我们的Q表是无用的；我们需要使用Q-学习算法来训练我们的Q函数。

Let’s do it for 2 training timesteps:

让我们来做两个训练时间步长：

Training timestep 1:

培训时间第一步：

Step 2: Choose action using Epsilon Greedy Strategy

第2步：使用Epsilon贪婪策略选择行动

Because epsilon is big = 1.0, I take a random action, in this case, I go right.

因为epsilon大=1.0，所以我采取随机操作，在本例中，我向右转。

Maze-Example

迷宫–例子

Step 3: Perform action At, gets Rt+1 and St+1

步骤3：在执行操作时，获得RT+1和ST+1

By going right, I’ve got a small cheese, so Rt+1=1R_{t+1} = 1Rt+1=1, and I’m in a new state.

通过向右，我得到了一个小奶酪，所以RT+1=1R_{t+1}=1Rt+1=1，我处于一个新的状态。

Maze-Example

迷宫–例子

Step 4: Update Q(St, At)

步骤4：更新Q(ST，AT)

We can now update Q(St,At)Q(S_t, A_t)Q(St,At) using our formula.

现在我们可以使用我们的公式更新Q(ST，At)Q(S_t，A_t)Q(ST，At)。

Maze-Example

Training timestep 2:

迷宫-示例迷宫-示例培训时间步骤2：

Step 2: Choose action using Epsilon Greedy Strategy

第2步：使用Epsilon贪婪策略选择行动

I take a random action again, since epsilon is big 0.99 (since we decay it a little bit because as the training progress, we want less and less exploration).

我再次采取随机行动，因为epsilon很大，0.99%(因为随着训练的进行，我们想要的探索越来越少，所以我们稍微衰减了一点)。

I took action down. Not a good action since it leads me to the poison.

我采取了行动。这不是一个好的行动，因为它把我引向了毒药。

Maze-Example

迷宫–例子

Step 3: Perform action At, gets Rt+1 and St+1

步骤3：在执行操作时，获得RT+1和ST+1

Because I go to the poison state, I get Rt+1=−10R_{t+1} = -10Rt+1=−10, and I die.

因为我进入中毒状态，所以我得到RT+1=−10R_{t+1}=-10Rt+1=−10，然后我就死了。

Maze-Example

迷宫–例子

Step 4: Update Q(St, At)

步骤4：更新Q(ST，AT)

Maze-Example
Because we’re dead, we start a new episode. But what we see here is that with two explorations steps, my agent became smarter.

迷宫–因为我们死了，我们就开始新的一集。但我们在这里看到的是，经过两个探索步骤，我的经纪人变得更聪明了。

As we continue exploring and exploiting the environment and updating Q-values using TD target, Q-table will give us better and better approximations. And thus, at the end of the training, we’ll get an estimate of the optimal Q-function.

随着我们继续探索和利用环境，并使用TD目标更新Q值，Q表将给我们越来越好的近似。因此，在训练结束时，我们将得到最优Q函数的估计。

Reinforcement

#Reinforcement

E4-Unit_2-Introduction_to_Q_Learning-H7-Introducing_Q_Learning 上一篇

E4-Unit_2-Introduction_to_Q_Learning-J9-Q_Learning_Recap 下一篇

E4-Unit_2-Introduction_to_Q_Learning-I8-A_Q_Learning_example

A Q-Learning example

一个问答学习的例子

Step 1: We initialize the Q-table

步骤1：初始化Q表

Step 2: Choose action using Epsilon Greedy Strategy

第2步：使用Epsilon贪婪策略选择行动

Step 3: Perform action At, gets Rt+1 and St+1

步骤3：在执行操作时，获得RT+1和ST+1

Step 4: Update Q(St, At)

步骤4：更新Q(ST，AT)

Step 2: Choose action using Epsilon Greedy Strategy

第2步：使用Epsilon贪婪策略选择行动

Step 3: Perform action At, gets Rt+1 and St+1

步骤3：在执行操作时，获得RT+1和ST+1

Step 4: Update Q(St, At)

步骤4：更新Q(ST，AT)