中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/deep-rl-course/unit5/pyramids?fw=pt

Mid-way Recap

中途总结

Before diving into Q-Learning, let’s summarize what we just learned.

在深入到Q-Learning之前，让我们总结一下我们刚刚学到的内容。

We have two types of value-based functions:

我们有两种类型的基于值的函数：

State-value function: outputs the expected return if the agent starts at a given state and acts accordingly to the policy forever after.
Action-value function: outputs the expected return if the agent starts in a given state, takes a given action at that state and then acts accordingly to the policy forever after.
In value-based methods, rather than learning the policy, we define the policy by hand and we learn a value function. If we have an optimal value function, we will have an optimal policy.

There are two types of methods to learn a policy for a value function:

状态-值函数：如果代理从给定的状态开始并永远按照策略行事，则输出预期回报。动作-值函数：如果代理在给定的状态开始，在该状态采取给定的操作，然后永远按照策略行事，则输出预期回报。在基于值的方法中，我们不学习策略，而是手工定义策略，并学习值函数。如果我们有一个最优的价值函数，我们就会有一个最优的策略。有两种方法可以学习一个价值函数的策略：

With the Monte Carlo method, we update the value function from a complete episode, and so we use the actual accurate discounted return of this episode.
With the TD Learning method, we update the value function from a step, so we replace GtG_tGt that we don’t have with an estimated return called TD target.

Summary

使用蒙特卡罗方法，我们从一个完整的剧集更新值函数，所以我们可以使用这一集的实际准确贴现收益。使用TD学习方法，我们从一个步骤更新值函数，所以我们用一个被称为TD目标的估计收益来代替我们没有的gtg_tgt。

Reinforcement

#Reinforcement

E4-Unit_2-Introduction_to_Q_Learning-E4-Monte_Carlo_vs_Temporal_Difference_Learning 上一篇

E4-Unit_2-Introduction_to_Q_Learning-G6-way_Quiz 下一篇

E4-Unit_2-Introduction_to_Q_Learning-F5-way_Recap

Mid-way Recap

中途总结