中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/deep-rl-course/unit1/conclusion?fw=pt

Glossary

词汇表

This is a community-created glossary. Contributions are welcomed!

这是一个社区创建的词汇表。欢迎投稿！

Markov Property

马尔科夫性质

It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions.

这意味着我们代理人采取的行动完全取决于当前状态，而与过去的状态和行动无关。

Observations/State

观察/状态

State: Complete description of the state of the world.
Observation: Partial description of the state of the environment/world.

Actions

状态：世界状态的完整描述。观察：环境/世界状态的部分描述。操作

Discrete Actions: Finite number of actions, such as left, right, up, and down.
Continuous Actions: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.

Rewards and Discounting

离散动作：有限数量的动作，如左、右、上、下。连续动作：动作的无限可能性；例如，在自动驾驶汽车的情况下，驾驶场景有无限的动作发生的可能性。奖励和折扣

Rewards: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
RL algorithms are focused on maximizing the cumulative reward.
Reward Hypothesis: RL problems can be formulated as a maximisation of (cumulative) return.
Discounting is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.

Tasks

奖励：RL中的基本因素。告诉代理人所采取的行动是好是坏。RL算法专注于最大化累积回报。奖励假设：RL问题可以表示为(累积)回报最大化。执行贴现是因为在开始时获得的奖励比长期奖励更容易发生。任务

Episodic: Has a starting point and an ending point.
Continuous: Has a starting point but no ending point.

Exploration v/s Exploitation Trade-Off

插曲：有起点和终点。连续：有起点但没有终点。探索与剥削的取舍

Exploration: Itâ€™s all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
Exploitation: Itâ€™s about exploiting what we know about the environment to gain maximum rewards.
Exploration-Exploitation Trade-Off: It balances how much we want to explore the environment and how much we want to exploit what we know about the environment.

Policy

探索：它是关于通过尝试随机的行动来探索环境，并从环境中获得反馈/回报/奖励。开发：它是关于利用我们对环境的了解来获得最大回报。探索-开发权衡：它平衡了我们有多想探索环境，有多想利用我们对环境的了解。™：™

Policy: It is called the agentâ€™s brain. It tells us what action to take, given the state.
Optimal Policy: Policy that maximizes the expected return when an agent acts according to it. It is learned through training.

Policy-based Methods:

政策：它被称为代理人-欧洲™的大脑。它告诉我们在给定的状态下应该采取什么行动。最优政策：当代理人按照最优政策行事时，最大化预期回报的政策。它是通过培训学习的。基于策略的方法：

An approach to solving RL problems.
In this method, the Policy is learned directly.
Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.

Value-based Methods:

一种解决RL问题的方法。在这种方法中，策略是直接学习的。将每个状态映射到该状态下的最佳对应操作。或该状态下可能动作集合的概率分布。基于值的方法：

Another approach to solving RL problems.
Here, instead of training a policy, we train a value function that maps each state to the expected value of being in that state.

Contributions are welcomed 🤗

另一种解决RL问题的方法。在这里，我们不是训练策略，而是训练一个值函数，将每个状态映射到处于该状态的期望值。贡献者欢迎🤗

If you want to improve the course, you can open a Pull Request.

如果您想改进课程，您可以打开Pull请求。

This glossary was made possible thanks to:

这个词汇表之所以成为可能，要归功于：

Reinforcement

#Reinforcement

B1-Unit_1-Introduction_to_Deep_Reinforcement_Learning-G6-The_Deep_in_Deep_Reinforcement_Learning 上一篇

B1-Unit_1-Introduction_to_Deep_Reinforcement_Learning-H7-Summary 下一篇

B1-Unit_1-Introduction_to_Deep_Reinforcement_Learning-I8-Glossary