B1-Unit_1-Introduction_to_Deep_Reinforcement_Learning-H7-Summary
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit1/quiz?fw=pt
Summary
摘要
That was a lot of information! Let’s summarize:
这可是一大堆信息啊!让我们总结一下:
Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment by interacting with it through trial and error and receiving rewards (negative or positive) as feedback.
强化学习是一种从行动中学习的计算方法。我们构建了一个从环境中学习的代理,方法是通过反复试验与环境进行交互,并接受回报(负面或正面)作为反馈。
The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected cumulative reward.
任何RL代理的目标都是最大化其预期累积回报(也称为期望回报),因为RL基于奖励假设,即所有目标都可以描述为预期累积回报的最大化。
The RL process is a loop that outputs a sequence of state, action, reward and next state.
RL过程是输出状态、动作、奖励和下一状态序列的循环。
To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long term future reward.
为了计算预期的累积回报(预期回报),我们对奖励进行折现:较早到来的奖励(在游戏开始时)发生的可能性更大,因为它们比长期未来的奖励更可预测。
To solve an RL problem, you want to find an optimal policy. The policy is the “brain” of your agent, which will tell us what action to take given a state. The optimal policy is the one which gives you the actions that maximize the expected return.
要解决RL问题,您需要找到最优策略。策略是您的代理的“大脑”,它将告诉我们在给定的状态下应该采取什么行动。最优策略是给你带来最大预期回报的行动。
There are two ways to find your optimal policy:
有两种方法可以找到您的最佳策略:
- By training your policy directly: policy-based methods.
- By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
Finally, we speak about Deep RL because we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name “deep”.
通过直接训练您的策略:基于策略的方法。通过训练一个值函数来告诉我们代理将在每个状态获得的预期回报,并使用该函数来定义我们的策略:基于值的方法。最后,我们谈到深度RL,因为我们引入了深度神经网络来估计要采取的操作(基于策略)或估计状态的值(基于值),因此得名“深度”。