中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/course/chapter6/6?fw=pt

Byte-Pair Encoding tokenization

字节对编码标记化

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

字节对编码(BPE)最初是作为一种压缩文本的算法开发的，然后在预训练GPT模型时被OpenAI用于标记化。它被许多Transformer型号使用，包括GPT、GPT-2、Roberta、BART和DeBERTa。

💡 This section covers BPE in depth, going as far as showing a full implementation. You can skip to the end if you just want a general overview of the tokenization algorithm.

💡本节深入介绍了BPE，甚至展示了一个完整的实现。如果您只想大致了解标记化算法，可以跳到末尾。

Training algorithm

训练算法

BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:

BPE训练首先计算语料库中使用的唯一单词集(在标准化和预词化步骤完成之后)，然后通过获取用于书写这些单词的所有符号来构建词汇表。作为一个非常简单的例子，假设我们的语料库使用以下五个单词：

1	`"hug", "pug", "pun", "bun", "hugs"`

The base vocabulary will then be ["b", "g", "h", "n", "p", "s", "u"]. For real-world cases, that base vocabulary will contain all the ASCII characters, at the very least, and probably some Unicode characters as well. If an example you are tokenizing uses a character that is not in the training corpus, that character will be converted to the unknown token. That’s one reason why lots of NLP models are very bad at analyzing content with emojis, for instance.

基本词汇表将是[“b”、“g”、“h”、“n”、“p”、“s”、“u”]。对于实际情况，该基本词汇表将至少包含所有ASCII字符，可能还包含一些Unicode字符。如果您要标记化的示例使用的字符不在训练语料库中，则该字符将被转换为未知标记。例如，这就是为什么许多NLP模型在分析表情符号内容方面非常糟糕的原因之一。

The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.

GPT-2和Roberta标记器(它们非常相似)有一个聪明的方法来处理这个问题：它们不会将单词视为使用Unicode字符编写，而是使用字节。这样，基本词汇表的大小很小(256)，但您能想到的每个字符仍将被包括在内，并且最终不会被转换为未知的令牌。这种技巧称为字节级BPE。

After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.

在获得这个基本词汇后，我们添加新的标记，直到通过学习合并达到所需的词汇大小，这是将现有词汇中的两个元素合并为一个新元素的规则。因此，在开始时，这些合并将创建具有两个字符的记号，然后，随着训练的进行，会产生更长的子词。

At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.

在记号器训练期间的任何步骤，BPE算法都将搜索最频繁的现有记号对(这里的“对”，我们指的是一个单词中的两个连续记号)。最频繁的那一对将被合并，我们清洗并重复下一步。

Going back to our previous example, let’s assume the words had the following frequencies:

回到前面的例子，让我们假设单词的频率如下：

1	`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`

meaning "hug" was present 10 times in the corpus, "pug" 5 times, "pun" 12 times, "bun" 4 times, and "hugs" 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:

意思是‘拥抱’在语料库中出现10次，‘pug’5次，‘pun’12次，‘bun’4次，‘拥抱’5次。我们首先将每个单词拆分成字符(构成我们初始词汇的字符)，这样我们就可以将每个单词看作一个记号列表：

1	`("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)`

Then we look at pairs. The pair ("h", "u") is present in the words "hug" and "hugs", so 15 times total in the corpus. It’s not the most frequent pair, though: that honor belongs to ("u", "g"), which is present in "hug", "pug", and "hugs", for a grand total of 20 times in the vocabulary.

然后我们来看看配对。(“h”，“u”)出现在单词“hug”和“hugs”中，所以在语料库中加起来是15倍。然而，这并不是最常见的一对：这一荣誉属于(u，g)，它出现在拥抱‘、’帕格‘和`’拥抱‘中，总共出现了20次。

Thus, the first merge rule learned by the tokenizer is ("u", "g") -> "ug", which means that "ug" will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:

因此，词汇化器学习的第一个合并规则是(“u”，“g”)->“ug”，这意味着“ug”将被添加到词汇表中，该对应该在语料库的所有单词中合并。在此阶段结束时，词汇表和语料库如下所示：

1 2	`Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"] Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)`

Now we have some pairs that result in a token longer than two characters: the pair ("h", "ug"), for instance (present 15 times in the corpus). The most frequent pair at this stage is ("u", "n"), however, present 16 times in the corpus, so the second merge rule learned is ("u", "n") -> "un". Adding that to the vocabulary and merging all existing occurrences leads us to:

现在，我们有一些导致令牌超过两个字符的对：例如，对(“h”，“ug”)(在语料库中出现15次)。在这个阶段最频繁的对是(“u”，“n”)，然而，它在语料库中出现了16次，所以学习的第二个合并规则是(“u”，“n”)->“un”。将它添加到词汇表中，并合并所有现有的词条，我们会发现：

1 2	`Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"] Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)`

Now the most frequent pair is ("h", "ug"), so we learn the merge rule ("h", "ug") -> "hug", which gives us our first three-letter token. After the merge, the corpus looks like this:

现在最频繁的对是(“h”，“ug”)，所以我们学习合并规则(“h”，“ug”)->“hug”，这给了我们第一个三个字母的令牌。合并后，语料库如下所示：

1 2	`Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)`

And we continue like this until we reach the desired vocabulary size.

我们继续这样做，直到我们达到所需的词汇量。

✏️ Now your turn! What do you think the next merge rule will be?

✏️，现在轮到你了！您认为下一条合并规则是什么？

Tokenization algorithm

标记化算法

Tokenization follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:

标记化紧跟培训过程，即通过应用以下步骤对新输入进行标记化：

Normalization
Pre-tokenization
Splitting the words into individual characters
Applying the merge rules learned in order on those splits

Let’s take the example we used during training, with the three merge rules learned:

规范化预标记化将单词拆分成单个字符按顺序对这些拆分应用所学的合并规则让我们以我们在培训期间使用的示例为例，学习了三条合并规则：

1
2
3

("u", "g") -> "ug"
("u", "n") -> "un"
("h", "ug") -> "hug"

The word "bug" will be tokenized as ["b", "ug"]. "mug", however, will be tokenized as ["[UNK]", "ug"] since the letter "m" was not in the base vocabulary. Likewise, the word "thug" will be tokenized as ["[UNK]", "hug"]: the letter "t" is not in the base vocabulary, and applying the merge rules results first in "u" and "g" being merged and then "hu" and "g" being merged.

单词“bug”将被标记为[“b”，“ug”]。然而，“mug”将被标记为[“[unk]”，“ug”]，因为字母“m”不在基本词汇表中。同样地，单词“treg”将被标记为[“[unk]”，“hug”]：字母“t”不在基本词汇表中，并且应用合并规则首先导致“u”和“g”合并，然后是“hu”和“g”合并。

✏️ Now your turn! How do you think the word "unhug" will be tokenized?

✏️，现在轮到你了！你认为‘解开拥抱’这个词将如何被标记化？

Implementing BPE

实施BPE

Now let’s take a look at an implementation of the BPE algorithm. This won’t be an optimized version you can actually use on a big corpus; we just want to show you the code so you can understand the algorithm a little bit better.

现在，让我们来看一下BPE算法的实现。这不是一个可以在大型语料库上实际使用的优化版本；我们只是想向您展示代码，以便您可以更好地理解算法。

First we need a corpus, so let’s create a simple one with a few sentences:

首先，我们需要一个语料库，所以让我们创建一个包含几句话的简单语料库：

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

Next, we need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer (like GPT-2), we will use the gpt2 tokenizer for the pre-tokenization:

接下来，我们需要将该语料库预先标记化为单词。由于我们复制的是BPE令牌化器(如GPT-2)，因此我们将使用gpt2令牌化器进行预令牌化：

1
2
3

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:

然后，我们在进行预标记化时计算每个单词在语料库中的频率：

from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(int, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1,
    'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1,
    'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1,
    'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})

The next step is to compute the base vocabulary, formed by all the characters used in the corpus:

下一步是计算基本词汇表，它由语料库中使用的所有字符组成：

alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

1 2	`[ ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']`

We also add the special tokens used by the model at the beginning of that vocabulary. In the case of GPT-2, the only special token is "<|endoftext|>":

我们还在该词汇表的开头添加了模型使用的特殊标记。对于GPT-2，唯一的特殊标记是“<|endoftext|>”：

1	`vocab = ["<\|endoftext\|>"] + alphabet.copy()`

We now need to split each word into individual characters, to be able to start training:

我们现在需要将每个单词拆分成单独的字符，以便能够开始培训：

1	`splits = {word: [c for c in word] for word in word_freqs.keys()}`

Now that we are ready for training, let’s write a function that computes the frequency of each pair. We’ll need to use this at each step of the training:

现在我们已经为训练做好了准备，让我们编写一个函数来计算每一对的频率。我们需要在培训的每个步骤中使用以下内容：

def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair_freqs[pair] += freq
    return pair_freqs

Let’s have a look at a part of this dictionary after the initial splits:

在最初的拆分之后，让我们来看看这本词典的一部分：

pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('Ġ', 'i'): 2
('Ġ', 't'): 7
('t', 'h'): 3

Now, finding the most frequent pair only takes a quick loop:

现在，找到最频繁的配对只需快速循环：

best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

1	`('Ġ', 't') 7`

So the first merge to learn is ('Ġ', 't') -> 'Ġt', and we add 'Ġt' to the vocabulary:

因此，要学习的第一个合并是(‘Ġ’，‘t’)->‘Ġt’，然后我们将‘’Ġt‘’添加到词汇表中：

1 2	`merges = {("Ġ", "t"): "Ġt"} vocab.append("Ġt")`

To continue, we need to apply that merge in our splits dictionary. Let’s write another function for this:

要继续，我们需要在我们的‘Splits`词典中应用该合并。让我们为此编写另一个函数：

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue

        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

And we can have a look at the result of the first merge:

我们可以看看第一次合并的结果：

1 2	`splits = merge_pair("Ġ", "t", splits) print(splits["Ġtrained"])`

1	`['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']`

Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 50:

现在我们有了循环所需的所有东西，直到我们学习了所有我们想要的合并。让我们的目标是50个单词大小：

vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

As a result, we’ve learned 19 merge rules (the initial vocabulary had a size of 31 — 30 characters in the alphabet, plus the special token):

结果，我们学习了19条合并规则(初始词汇表的字母表大小为31-30个字符，外加特殊标记)：

1	`print(merges)`

{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en',
 ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok',
 ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe',
 ('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}

And the vocabulary is composed of the special token, the initial alphabet, and all the results of the merges:

词汇表由特殊符号、初始字母表和合并的所有结果组成：

1	`print(vocab)`

1
2
3

['<|endoftext|>', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o',
 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt', 'is', 'er', 'Ġa', 'Ġto', 'en', 'Th', 'This', 'ou', 'se',
 'Ġtok', 'Ġtoken', 'nd', 'Ġis', 'Ġth', 'Ġthe', 'in', 'Ġab', 'Ġtokeni']

💡 Using train_new_from_iterator() on the same corpus won’t result in the exact same vocabulary. This is because when there is a choice of the most frequent pair, we selected the first one encountered, while the 🤗 Tokenizers library selects the first one based on its inner IDs.

在相同的语料库上使用💡_new_from_iterator()不会产生完全相同的词汇表。这是因为当有最频繁的对可供选择时，我们选择遇到的第一对，而🤗令牌化器库根据其内部ID选择第一对。

To tokenize a new text, we pre-tokenize it, split it, then apply all the merge rules learned:

要对新文本进行标记化，我们对其进行预标记化、拆分，然后应用学到的所有合并规则：

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split

    return sum(splits, [])

We can try this on any text composed of characters in the alphabet:

我们可以在由字母表中的字符组成的任何文本上尝试此操作：

1	`tokenize("This is not a token.")`

1	`['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']`

⚠️ Our implementation will throw an error if there is an unknown character since we didn’t do anything to handle them. GPT-2 doesn’t actually have an unknown token (it’s impossible to get an unknown character when using byte-level BPE), but this could happen here because we did not include all the possible bytes in the initial vocabulary. This aspect of BPE is beyond the scope of this section, so we’ve left the details out.

⚠️如果存在未知字符，我们的实现将抛出错误，因为我们没有做任何处理它们的工作。GPT-2实际上没有未知的标记(使用字节级BPE时不可能获得未知字符)，但这里可能会发生这种情况，因为我们没有在初始词汇表中包括所有可能的字节。BPE的这一方面超出了本节的范围，因此我们省略了细节。

That’s it for the BPE algorithm! Next, we’ll have a look at WordPiece.

BPE算法到此为止！接下来，我们来看看WordPiess。

Transformer

#Course

6-The_Tokenizers_library-4-Normalization_and_pre-tokenization 上一篇

6-The_Tokenizers_library-6-WordPiece_tokenization 下一篇

6-The_Tokenizers_library-5-Byte-Pair_Encoding_tokenization