B1-Unit_1-Introduction_to_Deep_Reinforcement_Learning-E4-The_Exploration_Exploitation_tradeoff
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit1/summary?fw=pt
The Exploration/Exploitation trade-off
勘探与开发的取舍
Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.
最后,在研究解决强化学习问题的不同方法之前,我们必须涵盖一个非常重要的主题:探索/开发的权衡。
- Exploration is exploring the environment by trying random actions in order to find more information about the environment.
- Exploitation is exploiting known information to maximize the reward.
Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.
探索是通过尝试随机动作来探索环境,以找到更多关于环境的信息。剥削是利用已知信息来最大化回报。记住,我们的RL代理的目标是最大化预期的累积回报。然而,我们也可能落入一个共同的陷阱。
Let’s take an example:
让我们举个例子:

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
在这款游戏的探索中,我们的鼠标可以拥有无限数量的小奶酪(每个+1)。但在迷宫的顶端,有一大笔奶酪(+1000)。
However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).
然而,如果我们只关注剥削,我们的代理人永远不会达到奶酪的巨额金额。相反,它只会利用最近的回报来源,即使这个来源很小(剥削)。
But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).
但如果我们的经纪人做一点探索,它就不能发现大回报(一堆大奶酪)。
This is what we call the exploration/exploitation trade-off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.
这就是我们所说的勘探/开采权衡。我们需要平衡我们可以探索环境的程度和我们可以利用我们对环境的了解的程度。
Therefore, we must define a rule that helps to handle this trade-off. We’ll see the different ways to handle it in the future units.
因此,我们必须定义一个有助于处理这种权衡的规则。我们将在未来的单元中看到处理它的不同方法。
If it’s still confusing, think of a real problem: the choice of picking a restaurant:
如果它仍然令人困惑,想想一个真正的问题:选择一家餐厅:

Source: Berkley AI Course
探索来源:伯克利AI课程
- Exploitation: You go every day to the same one that you know is good and take the risk to miss another better restaurant.
- Exploration: Try restaurants you never went to before, with the risk of having a bad experience but the probable opportunity of a fantastic experience.
To recap:
利用:你每天都去同一家你知道不错的餐厅,并且冒着错过另一家更好的餐厅的风险。探索:尝试你以前从未去过的餐厅,冒着体验不好的风险,但可能会有一次美妙的体验。总结一下:

勘探开发权衡