2-Using_Transformers-5-Putting_it_all_together

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter2/6?fw=pt

Putting it all together

把这一切放在一起

Ask a Question
Open In Colab
Open In Studio Lab
In the last few sections, we’ve been trying our best to do most of the work by hand. We’ve explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

在过去的几个部分中,我们一直在尽最大努力用手工完成大部分的工作。在过去的几个部分中,™一直在尽最大努力手工完成大部分的工作。我们已经探索了标记器的工作原理,并研究了标记化、转换为输入ID、填充、截断和注意掩码。

However, as we saw in section 2, the 🤗 Transformers API can handle all of this for us with a high-level function that we’ll dive into here. When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model:

然而,正如我们在第2节中看到的,🤗Transformers API可以通过一个高级函数为我们处理所有这些问题,我们将在这里深入研究该函数。当您直接对语句调用`tokenizer‘时,您会得到准备好传递到模型中的输入:

1
2
3
4
5
6
7
8
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

Here, the model_inputs variable contains everything that’s necessary for a model to operate well. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the tokenizer object.

这里,™_inputs变量包含模型正常运行所需的所有内容。对于DistilBERT,它包括输入ID和注意掩码。其他接受额外输入的模型也将具有由tokenizer对象输出的那些输入。

As we’ll see in some examples below, this method is very powerful. First, it can tokenize a single sequence:

正如我们将在下面的一些示例中看到的那样,这种方法非常强大。首先,它可以标记化单个序列:

1
2
3
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

It also handles multiple sequences at a time, with no change in the API:

它还可以一次处理多个序列,接口不变:

1
2
3
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

It can pad according to several objectives:

它可以根据几个目标进行填充:

1
2
3
4
5
6
7
8
9
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

It can also truncate sequences:

它还可以截断序列:

1
2
3
4
5
6
7
8
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:

`tokenizer对象可以处理到特定框架张量的转换,然后可以将其直接发送到模型。例如,在下面的代码示例中,我们将提示记号赋值器从不同的框架返回张量--euro“”pt“返回PyTorch张量,”tf“返回TensorFlow张量,而”np“`返回NumPy数组:

1
2
3
4
5
6
7
8
9
10
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

Special tokens

特殊代币

If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

如果我们查看一下记号赋值器返回的输入ID,我们会发现它们与我们之前的稍有不同:

1
2
3
4
5
6
7
8
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
1
2
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about:

在开头添加了一个令牌ID,在末尾添加了一个。我们来对上面的两个ID序列进行解码,看看这是怎么回事:

1
2
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))
1
2
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

标记器在开头添加了特殊字[CLS],在结尾添加了特殊字[SEP]。这是因为模型已经用它们进行了预训练,所以为了获得相同的推理结果,我们也需要添加它们。请注意,有些型号™不会添加特殊单词或添加不同的单词;模型也可能只在开头或结尾添加这些特殊单词。在任何情况下,令牌器都知道哪些是预期的,并将为您处理这一问题。

Wrapping up: From tokenizer to model

总结:从标记器到模型

Now that we’ve seen all the individual steps the tokenizer object uses when applied on texts, let’s see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API:

现在我们已经看到了`tokenizer‘对象在文本上应用时使用的所有单独步骤,让?euro™最后一次了解它如何使用其主™处理多个序列(填充!)、非常长的序列(截断!)和多种类型的张量:

1
2
3
4
5
6
7
8
9
10
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)