1-Transformer_models-3-How_do_Transformers_work

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter1/4?fw=pt

How do Transformers work?

Transformer是如何工作的?

Ask a Question

问一个问题

In this section, we will take a high-level look at the architecture of Transformer models.

在这一节中,我们将对Transformer模型的体系结构进行高层次的介绍。

A bit of Transformer history

《Transformer》历史一瞥

Here are some reference points in the (short) history of Transformer models:

以下是Transformer模型(简短的)历史中的一些参考点:

A brief chronology of Transformers models.
A brief chronology of Transformers models.
The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

Transformer模型的简要年表。Transformer模型的简要年表。Transformer架构于2017年6月推出。原文研究的重点是翻译任务。随后介绍了几种有影响力的模式,包括:

  • June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
  • October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
  • February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
  • October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
  • October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)
  • May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)

This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:

2018年6月:GPT,第一个预先训练的Transformer模型,用于对各种NLP任务进行微调,并获得了最先进的结果2018年10月:BERT,另一个大型预先训练模型,这个模型旨在生成更好的句子摘要(下一章将详细介绍!)2019年2月:GPT-2,一个改进的(更大的)版本,由于道德方面的考虑,没有立即公开发布2019年10月:DistilBERT,BERT的蒸馏版本,速度快60%,内存减少40%,仍然保留了BERT 97%的性能2019年10月:BART和T5,两个大型的预先培训的模型与原始的Transformer模型使用相同的架构(第一个这样做),2020年5月,GPT-3,一个更大的版本的GPT-2,它能够在各种任务中很好地执行而不需要微调(称为零镜头学习)这个列表远远不是全面的,只是为了突出几种不同类型的Transformer模型。大体上,它们可以分为三类:

  • GPT-like (also called auto-regressive Transformer models)
  • BERT-like (also called auto-encoding Transformer models)
  • BART/T5-like (also called sequence-to-sequence Transformer models)

We will dive into these families in more depth later on.

GPT类(也称为自回归Transformer模型)、BERT类(也称为自动编码Transformer模型)、BART/T5类(也称为序列到序列Transformer模型)我们稍后将更深入地研究这些家族。

Transformers are language models

Transformer是语言模型

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

上述所有Transformer型号(GPT、BERT、BART、T5等)被训练成语言模特儿。这意味着他们已经以自我监督的方式接受了大量原始文本的培训。自我监督学习是一种训练类型,其中目标是从模型的输入自动计算出来的。这意味着不需要人类来标记数据!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

这种类型的模型发展了对它所接受培训的语言的统计理解,但对于具体的实际任务来说,它并不是很有用。正因为如此,一般的预训练模型会经历一个称为迁移学习的过程。在此过程中,将以监督的方式对模型进行微调–即,使用人工注释的标签–对给定任务进行调整。

An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

任务的一个例子是在阅读了前n个单词之后预测句子中的下一个单词。这被称为因果语言建模,因为输出依赖于过去和现在的输入,而不是未来的输入。

Example of causal language modeling in which the next word from a sentence is predicted.
Example of causal language modeling in which the next word from a sentence is predicted.
Another example is masked language modeling, in which the model predicts a masked word in the sentence.

预测句子中的下一个单词的因果语言建模示例。预测句子中的下一个单词的因果语言建模示例。另一个例子是掩蔽语言建模,在该模型中,模型预测句子中的掩蔽单词。

Example of masked language modeling in which a masked word from a sentence is predicted.
Example of masked language modeling in which a masked word from a sentence is predicted.

掩蔽语言建模的示例,其中预测句子中的掩蔽单词。掩蔽语言建模的示例,其中预测句子中的掩蔽单词。

Transformers are big models

Transformer是大模型

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

除了少数异常值(如DistilBERT)之外,实现更好性能的一般策略是通过增加模型的大小以及它们预先训练的数据量。

Number of parameters of recent Transformers models
Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph.

最近Transformer模型的参数数量遗憾的是,训练一个模型,特别是一个大的模型,需要大量的数据。这在时间和计算资源方面变得非常昂贵。它甚至转化为对环境的影响,如下图所示。

The carbon footprint of a large language model.
The carbon footprint of a large language model.

一个大型语言模型的碳足迹。一个大型语言模型的碳足迹。

And this is showing a project for a (very big) model led by a team consciously trying to reduce the environmental impact of pretraining. The footprint of running lots of trials to get the best hyperparameters would be even higher.

这是一个(非常大的)模型的项目,由一个团队领导,有意识地试图减少预培训对环境的影响。为获得最佳超参数而进行大量试验的足迹将会更大。

Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!

想象一下,如果每次一个研究团队、一个学生组织或一家公司想要培训一个模型时,它都是从头开始的。这将导致巨大的、不必要的全球成本!

This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

这就是共享语言模型至关重要的原因:共享训练权重并在已经训练权重的基础上进行构建,可以减少社区的总体计算成本和碳足迹。

By the way, you can evaluate the carbon footprint of your models’ training through several tools. For example ML CO2 Impact or Code Carbon which is integrated in 🤗 Transformers. To learn more about this, you can read this blog post which will show you how to generate an emissions.csv file with an estimate of the footprint of your training, as well as the documentation of 🤗 Transformers addressing this topic.

顺便说一句,您可以通过几个工具评估您的模型培训的碳足迹。例如,集成在🤗Transformer中的ML CO2 Impact或Code Carbon。要了解更多这方面的信息,您可以阅读这篇博客文章,其中将向您展示如何生成一个带有您的培训占用空间估计的emituss.csv文件,以及针对此主题的🤗Transformer文档。

Transfer Learning

迁移学习

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

预训练是从头开始训练模型的行为:随机初始化权重,在没有任何先验知识的情况下开始训练。

The pretraining of a language model is costly in both time and money.
The pretraining of a language model is costly in both time and money.
This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

语言模型的预培训既耗费时间又耗费金钱。语言模型的预培训既耗费时间又耗费金钱。这种预训练通常是在非常大量的数据上进行的。因此,它需要非常大的数据语料库,培训可能需要长达几周的时间。

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train directly for the final task? There are a couple of reasons:

另一方面,微调是在模型经过预训练后进行的训练。要执行微调,您首先需要获取预先训练好的语言模型,然后使用特定于您的任务的数据集执行额外的训练。等等–为什么不直接为最后的任务进行训练呢?原因有几个:

  • The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  • Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  • For the same reason, the amount of time and resources needed to get good results are much lower.

For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

预先训练的模型已经在与微调数据集有一些相似之处的数据集上进行了训练。因此,微调过程能够利用初始模型在预训练期间获得的知识(例如,对于NLP问题,预训练的模型将具有对任务所使用的语言的某种统计理解)。由于预训练的模型已经在大量数据上进行了训练,因此微调需要的数据要少得多才能获得令人满意的结果。出于同样的原因,获得好的结果所需的时间和资源要少得多。例如,可以利用在英语语言上训练的预训练模型,然后在arxiv语料库上对其进行微调,从而产生基于科学/研究的模型。微调只需要有限的数据量:预先训练的模型所获得的知识是“转移的”,因此称为转移学习。

The fine-tuning of a language model is cheaper than pretraining in both time and money.
The fine-tuning of a language model is cheaper than pretraining in both time and money.
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

语言模型的微调在时间和金钱上都比预训便宜。语言模型的微调在时间和金钱上都比预训便宜。因此,微调模型具有更低的时间、数据、财务和环境成本。它还可以更快、更容易地迭代不同的微调方案,因为训练比完整的预训练约束更少。

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.

与从头开始培训相比,此过程还将获得更好的结果(除非您有大量数据),这就是为什么您应该始终尝试利用预先训练的模型–一个尽可能接近手头任务的模型–并对其进行微调。

General architecture

一般建筑

In this section, we’ll go over the general architecture of the Transformer model. Don’t worry if you don’t understand some of the concepts; there are detailed sections later covering each of the components.

在本节中,我们将回顾Transformer模型的一般体系结构。如果您不理解其中的一些概念,也不用担心;后面有详细的章节介绍每个组件。

Introduction

引言

The model is primarily composed of two blocks:

该模型主要由两个模块组成:

  • Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Architecture of a Transformers models
Architecture of a Transformers models
Each of these parts can be used independently, depending on the task:

编码器(左):编码器接收输入并构建其(其功能)的表示。这意味着模型被优化以从输入获得理解。解码器(右):解码器使用编码器的表示(特征)以及其他输入来生成目标序列。这意味着该模型已针对生成输出进行了优化。Transformers的体系结构模型Transformers的体系结构根据任务的不同,这些部分中的每一个都可以独立使用:

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models: Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

We will dive into those architectures independently in later sections.

仅编码器模型:适用于需要理解输入的任务,如句子分类和命名实体识别。仅解码器模型:适用于生成性任务,如文本生成。编码器-解码器模型或序列到序列模型:适用于需要输入的生成性任务,如翻译或摘要。我们将在后面的部分单独深入讨论这些体系结构。

Attention layers

关注层

A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! We will explore the details of attention layers later in the course; for now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

Transformer模型的一个关键功能是它们是由称为关注层的特殊层构建的。事实上,介绍Transformer体系结构的文章的标题是“注意就是您需要的全部”!我们将在后面的课程中探索注意力层的细节;现在,您需要知道的是,这一层将告诉模型在处理每个单词的表示时,要特别注意您传递给它的句子中的某些单词(并或多或少忽略其他单词)。

To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

为了将这一点放在上下文中,考虑一下将文本从英语翻译成法语的任务。在输入“You Like This Course”的情况下,翻译模型还需要注意相邻的单词“you”以获得单词“like”的正确翻译,因为在法语中,动词“like”根据主语的不同而有不同的变化。然而,句子的其余部分对该词的翻译没有用处。同样,在翻译“This”时,模型还需要注意单词“Course”,因为“This”根据相关名词是男性还是女性而有不同的翻译。再说一次,句子中的其他词对“This”的翻译没有影响。对于更复杂的句子(和更复杂的语法规则),该模型需要特别注意可能出现在句子中较远的单词,以正确翻译每个单词。

The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

同样的概念也适用于任何与自然语言相关的任务:一个单词本身是有意义的,但这个意义受到语境的深刻影响,语境可以是在被研究单词之前或之后的任何其他单词。

Now that you have an idea of what attention layers are all about, let’s take a closer look at the Transformer architecture.

现在,您已经了解了关注层的所有内容,让我们更仔细地看看Transformer架构。

The original architecture

原始的建筑

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

Transformer架构最初是为翻译而设计的。在训练期间,编码器接收某种语言的输入(句子),而解码器接收所需目标语言的相同句子。在编码器中,注意力层可以使用句子中的所有单词(因为,正如我们刚刚看到的,给定单词的翻译可以依赖于句子中后面和之前的内容)。然而,解码器是按顺序工作的,并且只能注意它已经翻译的句子中的单词(因此,只能注意当前正在生成的单词之前的单词)。例如,当我们预测了翻译目标的前三个单词时,我们将它们提供给解码器,然后解码器使用编码器的所有输入来尝试预测第四个单词。

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

为了在训练期间加快速度(当模型可以访问目标句子时),解码器被提供整个目标,但不允许使用未来的单词(如果它在尝试预测位置2的单词时可以访问位置2的单词,问题就不会很难!)。例如,当试图预测第四个单词时,注意力层将只能访问位置1到3的单词。

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

最初的Transformer架构是这样的,编码器在左边,解码器在右边:

Architecture of a Transformers models
Architecture of a Transformers models
Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

Transformers的体系结构模型Transformers的体系结构注意,解码器块中的第一个关注层关注解码器的所有(过去的)输入,但第二个关注层使用编码器的输出。因此,它可以访问整个输入句子以最好地预测当前单词。这非常有用,因为不同的语言可以具有将单词按不同顺序排列的语法规则,或者句子中后面提供的一些上下文可能有助于确定给定单词的最佳翻译。

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

注意掩码也可以用在编解码器中,以防止模型注意到一些特殊的单词-例如,在成批处理句子时,用于使所有输入的长度相同的特殊填充单词。

Architectures vs. checkpoints

架构与检查点

As we dive into Transformer models in this course, you’ll see mentions of architectures and checkpoints as well as models. These terms all have slightly different meanings:

当我们在本课程中深入研究Transformer模型时,您将看到对体系结构和检查点以及模型的提及。这些术语的含义都略有不同:

  • Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
  • Checkpoints: These are the weights that will be loaded in a given architecture.
  • Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

架构:这是模型的骨架–模型中发生的每一层和每个操作的定义。检查点:这些是将加载到给定架构中的权重。模型:这是一个总括术语,不像“架构”或“检查点”那样精确:它可以同时表示两者。本课程将在需要减少歧义时指定架构或检查点。例如,BERT是一个架构,而`bert-base-case‘是一个检查点,这是Google团队为BERT的第一个版本训练的一组权重。然而,人们可以说“伯特模型”和“‘伯特基例’模型。”