F5-Unit_3-Deep_Q_Learning_with_Atari_Games-E4-Glossary
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit6/conclusion?fw=pt
Glossary
词汇表
This is a community-created glossary. Contributions are welcomed!
这是一个社区创建的词汇表。欢迎投稿!
Tabular Method: type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
Q-learning is an example of tabular method since a table is used to represent the value for different state-action pairs.表格法:一种问题类型,其中状态和动作空间足够小,可以近似表示为数组和表的值函数。Q-学习是表格方法的一个例子,因为表格被用来表示不同状态-动作对的值。
Deep Q-Learning: method that trains a neural network to approximate, given a state, the different Q-values for each possible action at that state.
Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.深度Q-学习:一种训练神经网络的方法,在给定一个状态的情况下,为该状态下的每个可能动作近似不同的Q值。用于解决当观测空间太大而无法应用表格Q-学习方法时的问题。
Temporal Limitation: is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
In order to obtain temporal information, we need to stack a number of frames together.时间限制:当环境状态由帧表示时,这是一个困难。帧本身不提供时间信息。为了获得时间信息,我们需要将多个帧堆叠在一起。
Phases of Deep Q-Learning:
深度问答学习的各个阶段:
- Sampling: actions are performed, and observed experience tuples are stored in a replay memory.
- Training: batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
Solutions to stabilize Deep Q-Learning:
采样:执行操作,并将观察到的经验元组存储在回放存储器中。训练:随机选择一批元组,神经网络使用梯度下降更新其权重。稳定深度Q-学习的解决方案:
- Experience Replay: a replay memory is created to save experiences samples that can be reused during training.
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
Random sampling from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
catastrophically. - Fixed Q-Target: In order to calculate the Q-Target we need to estimate the discounted optimal Q-value of the next state by using Bellman equation. The problem
is that the same network weigths are used to calculate the Q-Target and the Q-value. This means that everytime we are modifying the Q-value, the Q-Target also moves with it.
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
our Deep Q-Network after certain C steps. - Double DQN: method to handle overstimation of Q-Values. This solution uses two networks to decouple the action selection from the target -Value generation:
-DQN Network to select the best action to take for the next state (the action with the highest Q-Value)
-Target Network to calculate the target Q-Value of taking that action at the next state.
This approach reduce the Q-Values overstimation, it helps to train faster and have more stable learning.
- Experience Replay: a replay memory is created to save experiences samples that can be reused during training.
If you want to improve the course, you can open a Pull Request.
体验回放:创建回放记忆以保存可在培训期间重复使用的体验样本。这允许工程师多次从相同的体验中学习。此外,它使代理避免忘记以前的经验,因为它获得了新的经验。重放缓冲区的随机采样可以消除观测序列中的相关性,防止动作值发生灾难性的振荡或发散。固定Q-目标:为了计算Q-目标,我们需要使用Bellman方程估计下一个状态的贴现最优Q值。问题是使用相同的网络重量来计算Q-Target和Q-Value。这意味着每次我们修改Q值时,Q目标也会随之移动。为了避免这一问题,使用具有固定参数的单独网络来估计时间差分目标。在某些C步之后,通过从我们的深度Q网络复制参数来更新目标网络。Double DQN:处理Q值过高的方法。该解决方案使用两个网络来将动作选择与目标值生成分离:-DQN网络以选择对于下一状态要采取的最佳动作(具有最高Q值的动作)-目标网络以计算在下一状态采取该动作的目标Q值。这种方法减少了对Q值的过分相信,有助于更快的训练和更稳定的学习。如果你想改进课程,你可以打开一个拉取请求。
This glossary was made possible thanks to:
这个词汇表之所以成为可能,要归功于: