E4-Unit_2-Introduction_to_Q_Learning-K10-Glossary
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit4/pg-theorem?fw=pt
Glossary
词汇表
This is a community-created glossary. Contributions are welcomed!
这是一个社区创建的词汇表。欢迎投稿!
Strategies to find the optimal policy
寻找最优策略的策略
- Policy-based methods. The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
- Value-based methods. In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn’t define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
Among the value-based methods, we can find two main strategies
以政策为基础的方法。该策略通常使用神经网络进行训练,以选择在给定的状态下采取什么行动。在这种情况下,是神经网络,它输出代理应该采取的操作,而不是使用值函数。根据环境获得的经验,神经网络将被重新调整,并将提供更好的操作。基于值的方法。在这种情况下,值函数被训练为输出将表示我们的策略的状态或状态-动作对的值。但是,该值并不定义代理应该执行什么操作。相反,我们需要指定给定Value函数输出的代理的行为。例如,我们可以决定采取一项政策,采取总是导致最大回报的行动(贪婪政策)。总之,策略是一种贪婪的策略(或用户做出的任何决定),它使用价值函数的值来决定要采取的操作。
- The state-value function. For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
- The action-value function. In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
Epsilon-greedy strategy:
状态值函数。对于每个状态,状态值函数是代理在该状态下开始并遵循策略直到结束时的预期回报。与STATE-VALUE函数不同,ACTION-VALUE计算每个状态和操作对的预期回报(如果代理在该状态下启动并采取操作)。然后它永远遵循这一政策。Epsilon-贪婪策略:
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
- Chooses a random action with a probability of epsilon.
- Epsilon is typically decreased over time to shift focus towards exploitation.
Greedy strategy:
强化学习中使用的常见探索策略,涉及平衡探索和开发。选择预期回报最高的动作,概率为1埃。选择概率为埃的随机动作。Epsilon通常随时间减少,以将重点转移到开发。贪婪策略:
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
- Always chooses the action with the highest expected reward.
- Does not include any exploration.
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
If you want to improve the course, you can open a Pull Request.
包括总是根据当前对环境的了解,选择有望带来最高回报的行动。(仅限利用)始终选择预期回报最高的操作。不包括任何探索。在具有不确定性或未知最优操作的环境中可能不利。如果您想要改进课程,您可以打开Pull请求。
This glossary was made possible thanks to:
这个词汇表之所以成为可能,要归功于: