3-Fine-tuning_a_pretrained_model-1-Processing_the_data

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter3/2?fw=pt

Processing the data

正在处理数据

Ask a Question
Open In Colab
Open In Studio Lab
Continuing with the example from the [previous chapter], here is how we would train a sequence classifier on one batch in PyTorch:

在Studio Lab中打开的Colab中询问问题继续上一章中的示例,下面是我们如何在PyTorch中训练一批序列分类器的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Of course, just training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset.

当然,仅仅在两个句子上训练模型不会产生很好的结果。为了获得更好的结果,您需要准备一个更大的数据集。

In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). We’ve selected it for this chapter because it’s a small dataset, so it’s easy to experiment with training on it.

在本节中,我们将使用由William B.Dolan和Chris Brockett在一篇论文中介绍的MRPC(微软研究解释语料库)数据集作为示例。该数据集由5801对句子组成,并有一个标签表明它们是否是意译(即,如果两个句子的意思相同)。我们为本章选择它是因为它是一个很小的数据集,所以很容易在上面进行培训。

Loading a dataset from the Hub

从中心加载数据集

The Hub doesn’t just contain models; it also has multiple datasets in lots of different languages. You can browse the datasets here, and we recommend you try to load and process a new dataset once you have gone through this section (see the general documentation here). But for now, let’s focus on the MRPC dataset! This is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

Hub不仅包含模型;它还拥有多种不同语言的多个数据集。您可以在此处浏览数据集,我们建议您在完成本节后尝试加载和处理新的数据集(请参阅此处的常规文档)。但现在,让我们将重点放在MRPC数据集!这是组成GLUE基准的10个数据集之一,GLUE基准是一个学术基准,用于衡量ML模型在10个不同文本分类任务中的性能。

The 🤗 Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

🤗数据集库提供了一个非常简单的命令来下载和缓存集线器上的数据集。我们可以像这样下载MRPC数据集:

1
2
3
4
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 408
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 1725
})
})

As you can see, we get a DatasetDict object which contains the training set, the validation set, and the test set. Each of those contains several columns (sentence1, sentence2, label, and idx) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

如您所见,我们得到了一个DatasetDict对象,其中包含训练集、验证集和测试集。它们中的每一个都包含几列(语句1‘、语句2’、标签idx)和可变数量的行数,它们是每个集合中的元素数量(因此,在训练集合中有3,668对句子,在验证集合中有408对句子,在测试集合中有1,725对句子)。

This command downloads and caches the dataset, by default in ~/.cache/huggingface/datasets. Recall from Chapter 2 that you can customize your cache folder by setting the HF_HOME environment variable.

此命令下载并缓存数据集,默认情况下位于~/.cache/huggingFaces/DataSets中。回想一下第2章,您可以通过设置HF_HOME环境变量来自定义您的缓存文件夹。

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

我们可以通过索引来访问Raw_Datasets对象中的每一对句子,比如使用词典:

1
2
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
1
2
3
4
{'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

We can see the labels are already integers, so we won’t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the features of our raw_train_dataset. This will tell us the type of each column:

我们可以看到标签已经是整数,所以我们不需要在那里做任何预处理。要知道哪个整数对应哪个标签,我们可以检查Raw_Train_DataSetFeatures。这将告诉我们每列的类型:

1
raw_train_dataset.features
1
2
3
4
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}

Behind the scenes, label is of type ClassLabel, and the mapping of integers to label name is stored in the names folder. 0 corresponds to not_equivalent, and 1 corresponds to equivalent.

在幕后,LabelClassLabel类型,整数到标签名称的映射存储在Names文件夹中。0对应于NOT_EQUALVIENT1对应于EQUEVALENT

✏️ Try it out! Look at element 15 of the training set and element 87 of the validation set. What are their labels?

✏️试试看吧!查看训练集的元素15和验证集的元素87。他们的标签是什么?

Preprocessing a dataset

对数据集进行预处理

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the [previous chapter], this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

要对数据集进行预处理,我们需要将文本转换为模型可以理解的数字。正如您在上一章中看到的,这是使用记号赋值器完成的。我们可以向标记器提供一句话或一组句子,因此我们可以直接标记化每对语句中的所有第一句和所有第二句,如下所示:

1
2
3
4
5
6
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However, we can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

然而,我们不能简单地将两个序列传递给模型并预测这两个句子是否是意译。我们需要将这两个序列作为一对来处理,并应用适当的预处理。幸运的是,记号赋予器还可以获取一对序列,并按照我们的BERT模型预期的方式进行准备:

1
2
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
1
2
3
4
5
{ 
'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

We discussed the input_ids and attention_mask keys in [Chapter 2], but we put off talking about token_type_ids. In this example, this is what tells the model which part of the input is the first sentence and which is the second sentence.

我们在第二章中讨论了input_idsnote_mask键,但我们推迟了Token_type_ids的讨论。在本例中,这就是告诉模型输入的哪个部分是第一个句子,哪个是第二个句子的信息。

✏️ Try it out! Take element 15 of the training set and tokenize the two sentences separately and as a pair. What’s the difference between the two results?

✏️试试看吧!获取训练集的元素15,并将这两个句子分别作为一对进行标记化。这两个结果有什么不同?

If we decode the IDs inside input_ids back to words:

如果我们将input_ids中的ID解码回单词:

1
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

we will get:

我们将获得:

1
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']

So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences. Aligning this with the token_type_ids gives us:

因此,我们可以看到,当有两个句子时,模型期望输入是[CLS]语句1[SEP]语句2[SEP]的形式。将其与TOKEN_TYPE_IDS进行对比,可以得出以下结论:

1
2
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

As you can see, the parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

如您所见,[CLS]语句1[SEP]对应的输入部分的令牌类型ID均为0‘,而语句2[SEP]对应的其他部分的令牌类型ID均为1`。

Note that if you select a different checkpoint, you won’t necessarily have the token_type_ids in your tokenized inputs (for instance, they’re not returned if you use a DistilBERT model). They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

请注意,如果您选择不同的检查点,您的标记化输入中不一定会有TOKEN_TYPE_IDS(例如,如果您使用DistilBERT模型,则不会返回它们)。只有当模型知道如何处理它们时,它们才会被归还,因为它在预训期间看到了它们。

Here, BERT is pretrained with token type IDs, and on top of the masked language modeling objective we talked about in [Chapter 1], it has an additional objective called next sentence prediction. The goal with this task is to model the relationship between pairs of sentences.

在这里,Bert使用令牌类型ID进行了预训练,在我们在第1章中讨论的掩蔽语言建模目标之上,它还有一个额外的目标,称为下一句预测。这项任务的目标是对句子对之间的关系进行建模。

With next sentence prediction, the model is provided pairs of sentences (with randomly masked tokens) and asked to predict whether the second sentence follows the first. To make the task non-trivial, half of the time the sentences follow each other in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

对于下一个句子预测,模型被提供句子对(带有随机掩蔽的标记),并被要求预测第二个句子是否在第一个句子之后。为了使任务不是微不足道的,一半的时间是句子在提取它们的原始文档中紧随其后,另一半的时间是两个句子来自两个不同的文档。

In general, you don’t need to worry about whether or not there are token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.

通常,您不需要担心您的标记化输入中是否有`TOKEN_TYPE_IDS‘:只要您对标记器和模型使用相同的检查点,一切都会很好,因为标记器知道要向其模型提供什么。

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset: like in the [previous chapter], we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options we saw in [Chapter 2]. So, one way to preprocess the training dataset is:

现在我们已经了解了记号赋值器如何处理一对句子,我们可以使用它来记号我们的整个数据集:与上一章一样,我们可以通过向记号赋值器提供第一个句子的列表,然后是第二个句子的列表,来向它提供句子对的列表。这也与我们在第2章中看到的填充和截断选项兼容。因此,对训练数据集进行预处理的一种方法是:

1
2
3
4
5
6
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
raw_datasets["train"]["sentence2"],
padding=True,
truncation=True,
)

This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).

这可以很好地工作,但它的缺点是返回一个字典(带有我们的键、input_idsnote_maskToken_type_ids,以及作为列表列表的值)。只有在标记化期间有足够的内存来存储整个数据集(而🤗数据集库中的数据集是存储在磁盘上的ApacheArrow文件,所以您只将请求的样本加载到内存中),它才会起作用。

To keep the data as a dataset, we will use the Dataset.map() method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

为了将数据保存为数据集,我们将使用Dataset.map()方法。这也为我们提供了一些额外的灵活性,如果我们需要进行比标记化更多的预处理。map()方法的工作原理是对数据集的每个元素应用一个函数,因此让我们定义一个函数来标记化我们的输入:

1
2
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids. Note that it also works if the example dictionary contains several samples (each key as a list of sentences) since the tokenizer works on lists of pairs of sentences, as seen before. This will allow us to use the option batched=True in our call to map(), which will greatly speed up the tokenization. The tokenizer is backed by a tokenizer written in Rust from the 🤗 Tokenizers library. This tokenizer can be very fast, but only if we give it lots of inputs at once.

此函数接受字典(与我们的数据集的项类似),并返回一个新字典,其中键为input_idsnote_maskToken_type_ids。请注意,它也适用于如果Example词典包含几个样本(每个关键字作为一个句子列表),因为tokenizer‘在句子对列表上工作,如前所述。这将允许我们在对map()的调用中使用Batcher=True选项,这将大大加快标记化速度。令牌化器`由🤗令牌化器库中用RUST编写的令牌化器支持。这个标记器可以非常快,但前提是我们一次给它大量的输入。

Note that we’ve left the padding argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

请注意,目前我们在标记化函数中省略了padding参数。这是因为将所有样本填充到最大长度是低效的:在构建批时最好填充样本,因为这样我们只需要填充到该批中的最大长度,而不是整个数据集中的最大长度。当输入的长度非常可变时,这可以节省大量的时间和处理能力!

Here is how we apply the tokenization function on all our datasets at once. We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

下面是我们如何一次对所有数据集应用标记化函数。我们在对map的调用中使用了Batcher=True,因此该函数一次应用于数据集的多个元素,而不是分别应用于每个元素。这允许更快的预处理。

1
2
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

The way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the preprocessing function:

🤗数据集库应用此处理的方式是向数据集添加新字段,每个字段对应于由预处理函数返回的字典中的每个键:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
DatasetDict({
train: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 3668
})
validation: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 408
})
test: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 1725
})
})

You can even use multiprocessing when applying your preprocessing function with map() by passing along a num_proc argument. We didn’t do this here because the 🤗 Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.

在使用map()应用您的预处理函数时,您甚至可以通过传递一个num_pro参数来使用多处理。我们在这里没有这样做,因为🤗令牌化器库已经使用多个线程来更快地对我们的样本进行令牌化,但是如果您没有使用由该库支持的快速令牌化器,这可能会加快您的预处理速度。

Our tokenize_function returns a dictionary with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset. Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied map().

我们的tokenize_unction返回一个字典,其中包含关键字input_idsnote_maskToken_type_ids,因此这三个字段将被添加到数据集的所有拆分中。注意,如果我们的预处理函数为我们应用了map()的数据集中的现有键返回了一个新值,那么我们也可以更改现有的字段。

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding.

当我们将元素批处理在一起时,我们需要做的最后一件事是将所有示例填充到最长元素的长度–我们称之为动态填充技术。

Dynamic padding

动态填充

The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.

负责将样本放在批中的函数称为COLLATE函数。这是一个可以在构建DataLoader时传递的参数,默认情况下,该函数只会将您的样本转换为PyTorch张量并将它们连接在一起(如果您的元素是列表、元组或字典,则以递归方式进行)。在我们的情况下,这是不可能的,因为我们拥有的输入不会都是相同大小的。我们特意推迟了填充,只在每个批次上根据需要应用填充,避免输入过长的填充。这将加快相当多的训练速度,但请注意,如果你在TPU上进行训练,它可能会产生问题-TPU更喜欢固定的形状,即使这需要额外的填充。

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the 🤗 Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

要在实践中做到这一点,我们必须定义一个COLLATE函数,该函数将对我们要一起批处理的数据集的项应用正确的填充量。幸运的是,🤗Transformer库通过DataCollatorWithPadding为我们提供了这样一个函数。当您实例化它时,它需要一个标记器(以了解要使用哪个填充标记,以及模型预期填充位于输入的左侧还是右侧),并将执行所需的所有操作:

1
2
3
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, let’s grab a few samples from our training set that we would like to batch together. Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the lengths of each entry in the batch:

为了测试这个新玩具,让我们从我们的训练集中抓取一些我们想要批量处理的样本。这里,我们删除了idxsenence1senence2列,因为它们不需要,并且包含字符串(并且我们不能使用字符串创建张量),并查看批中每个条目的长度:

1
2
3
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]
1
[50, 59, 47, 67, 59, 50, 62, 32]

No surprise, we get samples of varying length, from 32 to 67. Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Let’s double-check that our data_collator is dynamically padding the batch properly:

毫不奇怪,我们得到了不同长度的样品,从32到67。动态填充是指该批次中的样品都应该填充到67个长度,这是该批次内的最大长度。如果没有动态填充,所有样本都必须填充到整个数据集中的最大长度,或者模型可以接受的最大长度。让我们再次检查一下我们的data_Collator是否正确地动态填充批:

1
2
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}
1
2
3
4
{'attention_mask': torch.Size([8, 67]),
'input_ids': torch.Size([8, 67]),
'token_type_ids': torch.Size([8, 67]),
'labels': torch.Size([8])}

Looking good! Now that we’ve gone from raw text to batches our model can deal with, we’re ready to fine-tune it!

看起来不错!现在我们已经从原始文本转变为我们的模型可以处理的批处理,我们准备对其进行微调!

✏️ Try it out! Replicate the preprocessing on the GLUE SST-2 dataset. It’s a little bit different since it’s composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.

✏️试试看吧!在GLUE SST-2数据集上复制预处理。它有一点不同,因为它是由单句组成,而不是成对的,但我们所做的其余部分应该看起来是一样的。更困难的挑战是,尝试编写一个可处理任何粘合任务的预处理函数。