N13-Bonus_Unit_3-Advanced_Topics_in_Reinforcement_Learning-B1-Based_Reinforcement_Learning

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/deep-rl-course/unit2/quiz2?fw=pt

Model Based Reinforcement Learning (MBRL)

基于模型的强化学习

Model-based reinforcement learning only differs from its model-free counterpart in learning a dynamics model, but that has substantial downstream effects on how the decisions are made.

基于模型的强化学习与基于模型的强化学习只在学习动力学模型方面有所不同,但这对如何做出决策有很大的下游影响。

The dynamics models usually model the environment transition dynamics, st+1=fθ(st,at) s_{t+1} = f_\theta (s_t, a_t) st+1​=fθ​(st​,at​), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.

动力学模型通常对环境变迁动力学进行建模,st+1=f theta(st,at)s_{t+1}=f_\​(s_t,a_t)st+1θ=fθ​(stθ,atθ),但也可以使用逆动力学模型(从状态到动作的映射)或奖励模型(预测奖励)。

Simple definition

简单定义

  • There is an agent that repeatedly tries to solve a problem, accumulating state and action data.
  • With that data, the agent creates a structured learning tool, a dynamics model, to reason about the world.
  • With the dynamics model, the agent decides how to act by predicting the future.
  • With those actions, the agent collects more data, improves said model, and hopefully improves future actions.

Academic definition

有一个智能体反复试图解决一个问题,积累状态和动作数据。根据这些数据,智能体创建一个结构化的学习工具,动力学模型,来推理世界。通过动力学模型,智能体通过预测未来来决定如何行动。通过这些动作,智能体收集更多的数据,改进所述模型,并有望改进未来的行动。学术定义

Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, learning a model of said environment, and then **leveraging the model for control (making decisions).

基于模型的强化学习(MBRL)遵循在环境中交互的代理的框架,学习所述环境的模型,然后**利用该模型进行控制(做出决策)。

Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function st+1=f(st,at) s_{t+1} = f (s_t , a_t) st+1​=f(st​,at​) and returns a reward at each step r(st,at) r(s_t, a_t) r(st​,at​). With a collected dataset D:=si,ai,si+1,ri D :={ s_i, a_i, s_{i+1}, r_i} D:=si​,ai​,si+1​,ri​, the agent learns a model, st+1=fθ(st,at) s_{t+1} = f_\theta (s_t , a_t) st+1​=fθ​(st​,at​) to minimize the negative log-likelihood of the transitions.

具体地说,代理在由转移函数st+1=f(st,at)s_{t+1}=f(s_t,a_t)st+1​=f(st​,at​)控制的马尔可夫决策过程中动作,并在每个步骤r(st,at)r(s_t,a_t)r(st​,at​)返回奖励。利用收集的数据集D:=si,ai,si+1,ri D:={s_i,a_i,s_{i+1},r_i}D:=si​,ai​,si+1​,ri​,代理学习模型st+1=f​(st,at)s_{t+1}=f_\​(s_t,a_t)st+1​=f​θ(stθ)。

We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, τ \tau τ, from a set of actions sampled from a uniform distribution U(a) U(a) U(a), (see paper or paper or paper).

我们采用基于样本的模型预测控制,使用学习的动力学模型,它从均匀分布U(A)采样的一组动作中,在有限的、递归预测的水平τ\tauτ上优化期望报酬(见论文或论文或论文)。

Further reading

进一步阅读

For more information on MBRL, we recommend you check out the following resources:

有关MBRL的更多信息,我们建议您查看以下资源:

Author

一篇关于调试MBR的博客文章。最近一篇关于MBRL的评论文章,作者

This section was written by Nathan Lambert

本部分由内森·兰伯特撰写