6-The_Tokenizers_library-4-Normalization_and_pre-tokenization

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/5?fw=pt

Normalization and pre-tokenization

正规化和预标记化

Ask a Question
Open In Colab
Open In Studio Lab
Before we dive more deeply into the three most common subword tokenization algorithms used with Transformer models (Byte-Pair Encoding [BPE], WordPiece, and Unigram), we’ll first take a look at the preprocessing that each tokenizer applies to text. Here’s a high-level overview of the steps in the tokenization pipeline:

在更深入地研究Transformer模型使用的三种最常见的子词标记化算法(字节对编码[BPE]、WordPiess和Unigram)之前,我们将首先了解每个标记器应用于文本的预处理。以下是标记化管道中步骤的高级概述:

The tokenization pipeline.
The tokenization pipeline.
Before splitting a text into subtokens (according to its model), the tokenizer performs two steps: normalization and pre-tokenization.

令牌化管道。令牌化管道。在将文本拆分成子标记之前(根据其模型),标记器执行两个步骤:规范化和预标记化。

Normalization

正规化

The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents. If you’re familiar with Unicode normalization (such as NFC or NFKC), this is also something the tokenizer may apply.

规范化步骤涉及一些常规清理,例如删除不必要的空格、小写和/或删除重音。如果您熟悉Unicode标准化(如NFC或NFKC),这也可能适用于记号赋值器。

The 🤗 Transformers tokenizer has an attribute called backend_tokenizer that provides access to the underlying tokenizer from the 🤗 Tokenizers library:

🤗Transformer令牌化器具有名为Backend_Tokenizer的属性,该属性提供从🤗令牌化器库访问底层令牌化器的权限:

1
2
3
4
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))
1
<class 'tokenizers.Tokenizer'>

The normalizer attribute of the tokenizer object has a normalize_str() method that we can use to see how the normalization is performed:

`tokenizer对象的Normizer属性有一个Normize_str()`方法,我们可以使用该方法来查看规范化是如何执行的:

1
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))
1
'hello how are u?'

In this example, since we picked the bert-base-uncased checkpoint, the normalization applied lowercasing and removed the accents.

在本例中,因为我们选择了bert-base-unased检查点,所以标准化应用了小写并去掉了重音。

✏️ Try it out! Load a tokenizer from the bert-base-cased checkpoint and pass the same example to it. What are the main differences you can see between the cased and uncased versions of the tokenizer?

✏️试试看吧!从`bert-base-case‘检查点加载一个令牌化器,并将相同的示例传递给它。您可以看到大小写版本和非大小写版本的记号器之间的主要区别是什么?

Pre-tokenization

预标记化

As we will see in the next sections, a tokenizer cannot be trained on raw text alone. Instead, we first need to split the texts into small entities, like words. That’s where the pre-tokenization step comes in. As we saw in [Chapter 2], a word-based tokenizer can simply split a raw text into words on whitespace and punctuation. Those words will be the boundaries of the subtokens the tokenizer can learn during its training.

正如我们将在下一节中看到的,记号器不能仅针对原始文本进行训练。相反,我们首先需要将文本分割成小实体,如单词。这就是预标记化步骤的用武之地。正如我们在第2章中看到的,基于单词的标记器可以简单地将原始文本拆分成带有空格和标点符号的单词。这些单词将是记号器在其训练期间可以学习的子记号的边界。

To see how a fast tokenizer performs pre-tokenization, we can use the pre_tokenize_str() method of the pre_tokenizer attribute of the tokenizer object:

要了解快速令牌化器如何执行预令牌化,可以使用tokenizer对象的pre_tokenizer属性的pre_tokenize_str()方法:

1
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
1
[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]

Notice how the tokenizer is already keeping track of the offsets, which is how it can give us the offset mapping we used in the previous section. Here the tokenizer ignores the two spaces and replaces them with just one, but the offset jumps between are and you to account for that.

请注意,标记器已经跟踪了偏移量,这就是它如何提供我们在上一节中使用的偏移量映射的方式。在这里,标记器忽略了这两个空格,只用一个空格替换它们,但偏移量在`are‘和’you‘之间跳转以说明这一点。

Since we’re using a BERT tokenizer, the pre-tokenization involves splitting on whitespace and punctuation. Other tokenizers can have different rules for this step. For example, if we use the GPT-2 tokenizer:

因为我们使用的是BERT标记器,所以预标记化涉及对空格和标点符号进行拆分。对于此步骤,其他令牌化器可以有不同的规则。例如,如果我们使用GPT-2令牌器:

1
2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")

it will split on whitespace and punctuation as well, but it will keep the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens:

它也会拆分空格和标点符号,但它会保留空格,并用`Ġ‘符号替换它们,如果我们解码这些标记,它就可以恢复原始空格:

1
2
[('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)),
('?', (19, 20))]

Also note that unlike the BERT tokenizer, this tokenizer does not ignore the double space.

还要注意,与BERT标记器不同,该标记器不会忽略双空格。

For a last example, let’s have a look at the T5 tokenizer, which is based on the SentencePiece algorithm:

在最后一个示例中,让我们来看一下基于SentencePiess算法的T5标记器:

1
2
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
1
[('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]

Like the GPT-2 tokenizer, this one keeps spaces and replaces them with a specific token (_), but the T5 tokenizer only splits on whitespace, not punctuation. Also note that it added a space by default at the beginning of the sentence (before Hello) and ignored the double space between are and you.

与GPT-2标记器一样,这个标记器保留空格并将其替换为特定的标记(_),但T5标记器只拆分空格,而不是标点符号。还要注意的是,它默认在句子的开头(Hello之前)添加了一个空格,并忽略了areyou之间的双空格。

Now that we’ve seen a little of how some different tokenizers process text, we can start to explore the underlying algorithms themselves. We’ll begin with a quick look at the broadly widely applicable SentencePiece; then, over the next three sections, we’ll examine how the three main algorithms used for subword tokenization work.

现在我们已经了解了一些不同的标记器是如何处理文本的,我们可以开始探索底层算法本身了。首先,我们将快速了解广泛适用的SentencePiess;然后,在接下来的三节中,我们将研究用于子词标记化的三种主要算法是如何工作的。

SentencePiece

SentencePiess

SentencePiece is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, . Used in conjunction with the Unigram algorithm (see section 7), it doesn’t even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).

SentencePiess是一种用于文本预处理的标记化算法,您可以将其与我们将在接下来的三节中看到的任何模型一起使用。它将文本视为unicode字符序列,并用特殊字符替换空格。与Unigram算法结合使用(参见第7节),它甚至不需要预标记化步骤,这对于不使用空格字符的语言(如中文或日语)非常有用。

The other main feature of SentencePiece is reversible tokenization: since there is no special treatment of spaces, decoding the tokens is done simply by concatenating them and replacing the _s with spaces — this results in the normalized text. As we saw earlier, the BERT tokenizer removes repeating spaces, so its tokenization is not reversible.

SentencePiess的另一个主要功能是可逆的标记化:由于没有对空格进行特殊处理,因此只需将它们连接起来并用空格替换`_‘,就可以完成对标记的解码–这会产生规范化的文本。正如我们前面看到的,BERT标记器删除了重复的空格,因此它的标记化是不可逆的。

Algorithm overview

算法概述

In the following sections, we’ll dive into the three main subword tokenization algorithms: BPE (used by GPT-2 and others), WordPiece (used for example by BERT), and Unigram (used by T5 and others). Before we get started, here’s a quick overview of how they each work. Don’t hesitate to come back to this table after reading each of the next sections if it doesn’t make sense to you yet.

在接下来的几节中,我们将深入研究三个主要的子词标记化算法:BPE(由GPT-2和其他人使用)、WordPiess(例如由Bert使用)和Unigram(由T5和其他人使用)。在我们开始之前,我们先简要介绍一下它们各自的工作原理。如果您还不理解,请在阅读完下面的每一节后,毫不犹豫地回到这个表中。

Model 型号 BPE BPE WordPiece WordPiess Unigram 单字
Training 培训 Starts from a small vocabulary and learns rules to merge tokens 从一个很小的词汇量开始,学习规则来合并标记 Starts from a small vocabulary and learns rules to merge tokens 从一个很小的词汇量开始,学习规则来合并标记 Starts from a large vocabulary and learns rules to remove tokens 从较大的词汇量开始,学习删除标记的规则
Training step 训练步骤 Merges the tokens corresponding to the most common pair 合并与最常见的标记对对应的标记 Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent 基于对的频率合并与具有最佳分数的对相对应的令牌,其中每个单独令牌的频率较低的特权对 Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus 删除词汇表中的所有标记,以最大限度地减少在整个语料库上计算的损失
Learns 学到了 Merge rules and a vocabulary 合并规则和词汇表 Just a vocabulary 只是一个词汇表 A vocabulary with a score for each token 为每个令牌打分的词汇表
Encoding 编码 Splits a word into characters and applies the merges learned during training 将一个单词拆分成多个字符,并应用在训练中学到的合并 Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word 查找词汇表中从开头开始的最长的子词,然后对单词的其余部分执行相同的操作 Finds the most likely split into tokens, using the scores learned during training 使用在培训过程中学到的分数,找到最有可能拆分成令牌的对象

Now let’s dive into BPE!

现在让我们潜入BPE!