2-Using_Transformers-4-Handling_multiple_sequences

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter2/5?fw=pt

Handling multiple sequences

处理多个序列

Ask a Question
Open In Colab
Open In Studio Lab

在工作室实验室的可乐公开赛中提问

In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:

在上一节中,我们探索了最简单的用例:对长度较小的单个序列进行推理。然而,一些问题已经浮出水面:

  • How do we handle multiple sequences?
  • How do we handle multiple sequences of different lengths?
  • Are vocabulary indices the only inputs that allow a model to work well?
  • Is there such a thing as too long a sequence?

Let’s see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.

我们如何处理多个序列?我们如何处理不同长度的多个序列?词汇索引是允许模型正常工作的唯一输入吗?是否存在太长的序列?让我们看看这些问题会带来什么类型的问题,以及我们如何使用🤗Transformers API来解决这些问题。

Models expect a batch of inputs

模型期望一批输入

In the previous exercise you saw how sequences get translated into lists of numbers. Let’s convert this list of numbers to a tensor and send it to the model:

在上一个练习中,您了解了如何将序列转换为数字列表。让我们将这个数字列表转换为张量,并将其发送到模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)
1
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Oh no! Why did this fail? “We followed the steps from the pipeline in section 2.

哦不!为什么这会失败呢?我们遵循了第二节中管道的步骤。

The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it:

问题是我们向模型发送了单个序列,而🤗Transformers模型默认情况下需要多个句子。在这里,我们尝试执行标记器在幕后执行的所有操作,当我们将其应用于`Sequence‘时。但如果仔细观察,您会发现记号赋值器不仅将输入ID列表转换为张量,还在其上添加了一个维度:

1
2
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])
1
2
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
2607, 2026, 2878, 2166, 1012, 102]])

Let’s try again and add a new dimension:

让我们重试并添加一个新维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

We print the input IDs as well as the resulting logits — here’s the output:

我们打印输入ID和生成的日志–以下是输出:

1
2
Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276, 2.8789]]

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

批处理是一次通过模型发送多个句子的行为。如果您只有一句话,您可以使用单个序列构建一个批处理:

1
batched_ids = [ids, ids]

This is a batch of two identical sequences!

这是一批两个完全相同的序列!

✏️ Try it out! Convert this batched_ids list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)!

✏️试试看吧!将这个Batched_ids列表转换为张量,并通过您的模型传递它。检查您是否获得了与以前相同的日志(但有两次)!

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually pad the inputs.

批处理允许模型在您输入多个句子的情况下工作。使用多个序列与使用单个序列构建批处理一样简单。不过,还有第二个问题。当你试图把两个(或更多)句子组合在一起时,它们的长度可能不同。如果您以前使用过张量,就会知道它们需要是矩形的,所以您不能直接将输入ID列表转换为张量。为了解决此问题,我们通常填充输入。

Padding the inputs

填充输入

The following list of lists cannot be converted to a tensor:

以下列表列表不能转换为张量:

1
2
3
4
batched_ids = [
[200, 200, 200],
[200, 200]
]

In order to work around this, we’ll use padding to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

为了解决这个问题,我们将使用填充来使我们的张量具有矩形。填充通过向值较少的句子添加一个称为填充标记的特殊单词来确保我们所有的句子具有相同的长度。例如,如果您有10个10个单词的句子和1个20个单词的句子,填充将确保所有句子都有20个单词。在我们的示例中,生成的张量如下所示:

1
2
3
4
5
6
padding_id = 100

batched_ids = [
[200, 200, 200],
[200, 200, padding_id],
]

The padding token ID can be found in tokenizer.pad_token_id. Let’s use it and send our two sentences through the model individually and batched together:

填充令牌ID可以在tokenizer.pad_Token_id中找到。让我们使用它,并通过模型分别发送我们的两个句子,然后一起批量发送:

1
2
3
4
5
6
7
8
9
10
11
12
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
1
2
3
4
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
[ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

我们批量预测中的Logit有问题:第二行应该与第二句的Logits相同,但我们得到了完全不同的值!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

这是因为Transformer模型的关键功能是关注层,这些关注层将每个令牌设置为上下文。这些将考虑填充令牌,因为它们关注序列的所有令牌。当通过模型传递不同长度的单个句子时,或者当传递一批应用了相同句子和填充的句子时,为了获得相同的结果,我们需要告诉那些关注层忽略填充标记。这是通过使用注意面具来完成的。

Attention masks

注意面罩

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

注意掩码是与输入ID张量形状完全相同的张量,填充了0和1:1表示应该关注相应的标记,0表示不应该关注相应的标记(即,它们应该被模型的关注层忽略)。

Let’s complete the previous example with an attention mask:

让我们用一个注意掩码来完成前面的示例:

1
2
3
4
5
6
7
8
9
10
11
12
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]

attention_mask = [
[1, 1, 1],
[1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
1
2
tensor([[ 1.5694, -1.3895],
[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

Now we get the same logits for the second sentence in the batch.

现在,我们得到批次中第二个句子的相同逻辑。

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

请注意,第二个序列的最后一个值是填充ID,它是注意掩码中的0值。

✏️ Try it out! Apply the tokenization manually on the two sentences used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!

✏️试试看吧!在第2节中使用的两句话(“我一生都在等待HuggingFace课程”)上手动应用标记化。和“我太讨厌这个了!”)通过模型传递它们,并检查您得到的日志是否与第2节中的相同。现在使用填充标记将它们批处理在一起,然后创建适当的注意掩码。检查您在浏览模型时是否获得了相同的结果!

Longer sequences

更长的序列

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

对于Transformer模型,我们可以传递的序列的长度是有限制的。大多数模型处理多达512或1024个令牌的序列,当被要求处理更长的序列时会崩溃。这个问题有两种解决方案:

  • Use a model with a longer supported sequence length.
  • Truncate your sequences.

Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

使用支持的序列长度较长的模型。截断您的序列。模型有不同的支持的序列长度,有些模型专门处理非常长的序列。LongForm就是一个例子,另一个例子是LED。如果您正在处理一项需要很长序列的任务,我们建议您查看一下这些模型。

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:

否则,我们建议您通过指定MAX_SEQUENCE_LENGTH参数截断您的序列:

1
sequence = sequence[:max_sequence_length]