中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/deep-rl-course/unitbonus3/introduction?fw=pt

The SnowballTarget Environment

雪球目标环境

SnowballTarget is an environment we created at Hugging Face using assets from Kay Lousberg. We have an optional section at the end of this Unit if you want to learn to use Unity and create your environments.

雪球目标雪球目标是我们在Hugging Face时使用凯·卢斯伯格的资源创建的环境。如果你想学习使用Unity并创建你的环境，我们在本单元的末尾有一个可选的部分。

The agent’s Goal

代理人的目标

The first agent you’re going to train is called Julien the bear 🐻. Julien is trained to hit targets with snowballs.

你要训练的第一个特工叫熊🐻朱利安。朱利安接受过用雪球击中目标的训练。

The Goal in this environment is that Julien hits as many targets as possible in the limited time (1000 timesteps). It will need to place itself correctly from the target and shootto do that.

在这种环境下的目标是朱利安在有限的时间(1000个时间步)内击中尽可能多的目标。要做到这一点，它需要正确地将自己放置在目标位置并进行射击。

In addition, to avoid “snowball spamming” (aka shooting a snowball every timestep), Julien has a “cool off” system (it needs to wait 0.5 seconds after a shoot to be able to shoot again).

此外，为了避免“雪球垃圾”(又名每一个时间步打一个雪球)，朱利安还有一个“冷却”系统(它需要在拍摄后等待0.5秒才能再次拍摄)。

Cool Off System
The agent needs to wait 0.5s before being able to shoot a snowball again

冷却系统代理需要等待0.5秒才能再次射击雪球

The reward function and the reward engineering problem

报酬函数与报酬工程问题

The reward function is simple. The environment gives a +1 reward every time the agent’s snowball hits a target. Because the agent’s Goal is to maximize the expected cumulative reward, it will try to hit as many targets as possible.

奖励功能很简单。每次代理的雪球击中目标时，环境都会给予+1奖励。因为代理的目标是最大化预期的累积回报，所以它会尝试击中尽可能多的目标。

Reward system
We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the reward engineering problem, which is having a too complex reward function to force your agent to behave as you want it to do.
Why? Because by doing that, you might miss interesting strategies that the agent will find with a simpler reward function.