中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/deep-rl-course/unit4/pg-theorem?fw=pt

Glossary

词汇表

This is a community-created glossary. Contributions are welcomed!

这是一个社区创建的词汇表。欢迎投稿！

Policy-based methods. The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
Value-based methods. In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn’t define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.

The state-value function. For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
The action-value function. In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.

Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
Chooses the action with the highest expected reward with a probability of 1-epsilon.
Chooses a random action with a probability of epsilon.
Epsilon is typically decreased over time to shift focus towards exploitation.

Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
Always chooses the action with the highest expected reward.
Does not include any exploration.
Can be disadvantageous in environments with uncertainty or unknown optimal actions.

If you want to improve the course, you can open a Pull Request.

包括总是根据当前对环境的了解，选择有望带来最高回报的行动。(仅限利用)始终选择预期回报最高的操作。不包括任何探索。在具有不确定性或未知最优操作的环境中可能不利。如果您想要改进课程，您可以打开Pull请求。

This glossary was made possible thanks to:

这个词汇表之所以成为可能，要归功于：

Reinforcement

#Reinforcement