6-The_Tokenizers_library-6-WordPiece_tokenization
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/7?fw=pt
WordPiece tokenization
WordPiess标记化
WordPiece is the tokenization algorithm Google developed to pretrain BERT. It has since been reused in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET. It’s very similar to BPE in terms of the training, but the actual tokenization is done differently.
Ask a Query在Colab Open in Studio Lab中打开WordPiess是谷歌开发的预训练Bert的标记化算法。自那以后,它已经在相当多的基于ERT的Transformer模型中被重用,例如DistilBERT、MobileBERT、Funnel Transformers和MPNET。它在训练方面与BPE非常相似,但实际的标记化是不同的。
💡 This section covers WordPiece in depth, going as far as showing a full implementation. You can skip to the end if you just want a general overview of the tokenization algorithm.
💡本节深入介绍了WordPiess,甚至展示了一个完整的实现。如果您只想大致了解标记化算法,可以跳到末尾。
Training algorithm
训练算法
⚠️ Google never open-sourced its implementation of the training algorithm of WordPiece, so what follows is our best guess based on the published literature. It may not be 100% accurate.
⚠️谷歌从未公开其WordPiess训练算法的实现,因此以下是我们基于已发表文献的最佳猜测。它可能不是100%准确的。
Like BPE, WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, "word" gets split like this:
像BPE一样,WordPiess从一个很小的词汇表开始,包括模型使用的特殊符号和首字母表。因为它通过添加前缀来识别子词(就像BERT的##一样),所以每个单词最初是通过将该前缀添加到单词中的所有字符来拆分的。例如,“word”的拆分方式如下:
1 | |
Thus, the initial alphabet contains all the characters present at the beginning of a word and the characters present inside a word preceded by the WordPiece prefix.
因此,初始字母表包含出现在单词开头的所有字符,以及出现在单词内部的字符,其前缀为WordPiess。
Then, again like BPE, WordPiece learns merge rules. The main difference is the way the pair to be merged is selected. Instead of selecting the most frequent pair, WordPiece computes a score for each pair, using the following formula:
score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)\mathrm{score} = (\mathrm{freq_of_pair}) / (\mathrm{freq_of_first_element} \times \mathrm{freq_of_second_element})score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)
然后,再一次像BPE一样,WordPiess学习合并规则。主要的区别是选择要合并的对的方式。WordPiess不是选择最频繁的一对,而是使用以下公式计算每一对的分数:score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)\mathrm{score}=(\mathm{freq_of_Pair})/(\mathm{freq_of_first_Element}\Times\mathrm{freq_of_second_element})score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)
By dividing the frequency of the pair by the product of the frequencies of each of its parts, the algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary. For instance, it won’t necessarily merge ("un", "##able") even if that pair occurs very frequently in the vocabulary, because the two pairs "un" and "##able" will likely each appear in a lot of other words and have a high frequency. In contrast, a pair like ("hu", "##gging") will probably be merged faster (assuming the word “hugging” appears often in the vocabulary) since "hu" and "##gging" are likely to be less frequent individually.
通过将对的频率除以其每个部分的频率的乘积,该算法对词汇表中单个部分出现频率较低的对的合并进行优先排序。例如,它不一定会合并(“un”,“##able”),即使这对词在词汇表中非常频繁地出现,因为两个对“un”和“##able”都可能各自出现在许多其他单词中,并且出现频率很高。相比之下,像(“hu”,“##gging”)这样的词合并的速度可能会更快(假设“拥抱”这个词经常出现在词汇表中),因为hu‘和##gging’单独出现的频率可能会更低。
Let’s look at the same vocabulary we used in the BPE training example:
让我们看一下我们在BPE培训示例中使用的相同词汇:
1 | |
The splits here will be:
这里的拆分将是:
1 | |
so the initial vocabulary will be ["b", "h", "p", "##g", "##n", "##s", "##u"] (if we forget about special tokens for now). The most frequent pair is ("##u", "##g") (present 20 times), but the individual frequency of "##u" is very high, so its score is not the highest (it’s 1 / 36). All pairs with a "##u" actually have that same score (1 / 36), so the best score goes to the pair ("##g", "##s") — the only one without a "##u" — at 1 / 20, and the first merge learned is ("##g", "##s") -> ("##gs").
因此,初始词汇表将是[“b”、“h”、“p”、“##g”、“##n”、“##s”、“##u”](如果我们暂时不考虑特殊标记)。最频繁的配对是(“##u”,“##g”)(出现20次),但##u“的个体频率很高,所以其得分不是最高的(1/36)。所有带有“##u”的对实际上都有相同的分数(1/36),所以最好的分数属于(“##g”,“##s”)对-唯一没有“##u”的对-1/20,并且第一次学习的合并是(“##g”,“##s”)->(“##gs”)。
Note that when we merge, we remove the ## between the two tokens, so we add "##gs" to the vocabulary and apply the merge in the words of the corpus:
请注意,当我们合并时,我们删除了两个标记之间的##,因此我们将“##GS”添加到词汇表中,并在语料库的单词中应用合并:
1 | |
At this point, "##u" is in all the possible pairs, so they all end up with the same score. Let’s say that in this case, the first pair is merged, so ("h", "##u") -> "hu". This takes us to:
在这一点上,“##u”在所有可能的对中,因此它们都以相同的分数结束。假设在本例中,第一对被合并,因此(“h”,“##u”)->“hu”。这将把我们带到:
1 | |
Then the next best score is shared by ("hu", "##g") and ("hu", "##gs") (with 1/15, compared to 1/21 for all the other pairs), so the first pair with the biggest score is merged:
那么下一个最好的分数被(“Hu”,“##g”)和(“hu”,“##gs”)共享(1/15,其他所有对都是1/21),所以分数最大的第一对被合并:
1 | |
and we continue like this until we reach the desired vocabulary size.
我们继续这样做,直到我们达到所需的词汇量。
✏️ Now your turn! What will the next merge rule be?
✏️,现在轮到你了!下一条合并规则是什么?
Tokenization algorithm
标记化算法
Tokenization differs in WordPiece and BPE in that WordPiece only saves the final vocabulary, not the merge rules learned. Starting from the word to tokenize, WordPiece finds the longest subword that is in the vocabulary, then splits on it. For instance, if we use the vocabulary learned in the example above, for the word "hugs" the longest subword starting from the beginning that is inside the vocabulary is "hug", so we split there and get ["hug", "##s"]. We then continue with "##s", which is in the vocabulary, so the tokenization of "hugs" is ["hug", "##s"].
词汇化在WordPiess和BPE中的不同之处在于,WordPiess只保存最终的词汇表,而不保存学习到的合并规则。从单词totokenize开始,WordPiess找到词汇表中最长的子词,然后拆分它。例如,如果我们使用在上面的示例中学习的词汇,对于单词“hugs”,词汇中从开头开始的最长的子词是“hug”,所以我们在那里拆分,得到[“hug”,“##s”]。然后我们继续词汇表中的“##s”,所以“hugs”的标记化是[“hug”,“##s”]。
With BPE, we would have applied the merges learned in order and tokenized this as ["hu", "##gs"], so the encoding is different.
使用BPE,我们将按顺序应用学习到的合并,并将其标记为[“hu”,“##gs”],因此编码是不同的。
As another example, let’s see how the word "bugs" would be tokenized. "b" is the longest subword starting at the beginning of the word that is in the vocabulary, so we split there and get ["b", "##ugs"]. Then "##u" is the longest subword starting at the beginning of "##ugs" that is in the vocabulary, so we split there and get ["b", "##u, "##gs"]. Finally, "##gs" is in the vocabulary, so this last list is the tokenization of "bugs".
作为另一个例子,让我们看看单词“bugs”将如何标记化。“b”是词汇表中从单词开头开始的最长的子词,因此我们在那里拆分,得到[“b”,“##UGS”]。那么“##u”是词汇表中从“##ugs”开头开始的最长的子词,所以我们在那里拆分,得到[“b”,“##u,”##gs“]。最后,”##gs“在词汇表中,所以最后这个列表是”bugs“的标记化。
When the tokenization gets to a stage where it’s not possible to find a subword in the vocabulary, the whole word is tokenized as unknown — so, for instance, "mug" would be tokenized as ["[UNK]"], as would "bum" (even if we can begin with "b" and "##u", "##m" is not the vocabulary, and the resulting tokenization will just be ["[UNK]"], not ["b", "##u", "[UNK]"]). This is another difference from BPE, which would only classify the individual characters not in the vocabulary as unknown.
当标记化达到不可能在词汇表中找到子词的阶段时,整个单词被标记化为未知–因此,例如,“mug”将被标记化为[“[UNK]”],“burn”也将被标记化为[“b”和“##u”“,”##m“”不是词汇表,并且所产生的标记化将仅仅是[“[UNK]”],而不是[“b”,“##u”,“[UNK]”])。这是与BPE的另一个不同之处,BPE只会将词汇表中没有的单个字符归类为未知。
✏️ Now your turn! How will the word "pugs" be tokenized?
✏️,现在轮到你了!“哈巴狗”这个词将如何被标记化?
Implementing WordPiece
实现WordPiess
Now let’s take a look at an implementation of the WordPiece algorithm. Like with BPE, this is just pedagogical, and you won’t able to use this on a big corpus.
现在,让我们来看一下WordPiess算法的实现。就像BPE一样,这只是一种教学方法,你不能在大型语料库上使用它。
We will use the same corpus as in the BPE example:
我们将使用与BPE示例中相同的语料库:
1 | |
First, we need to pre-tokenize the corpus into words. Since we are replicating a WordPiece tokenizer (like BERT), we will use the bert-base-cased tokenizer for the pre-tokenization:
首先,我们需要将语料库预先标记化为单词。由于我们正在复制WordPiess标记器(如BERT),因此我们将使用`bert-base-case‘标记器进行预标记化:
1 | |
Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:
然后,我们在进行预标记化时计算每个单词在语料库中的频率:
1 | |
1 | |
As we saw before, the alphabet is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by ##:
正如我们之前看到的,字母表是由单词的所有第一个字母和以##为前缀的单词中出现的所有其他字母组成的唯一集合:
1 | |
1 | |
We also add the special tokens used by the model at the beginning of that vocabulary. In the case of BERT, it’s the list ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]:
我们还在该词汇表的开头添加了模型使用的特殊标记。对于BERT,它是列表[“[PAD]”,“[UNK]”,“[CLS]”,“[SEP]”,“[MASK]”:
1 | |
Next we need to split each word, with all the letters that are not the first prefixed by ##:
接下来我们需要拆分每个单词,所有不是第一个字母前缀的字母都以##为前缀:
1 | |
Now that we are ready for training, let’s write a function that computes the score of each pair. We’ll need to use this at each step of the training:
现在我们已经为训练做好了准备,让我们编写一个函数来计算每一对的分数。我们需要在培训的每个步骤中使用以下内容:
1 | |
Let’s have a look at a part of this dictionary after the initial splits:
在最初的拆分之后,让我们来看看这本词典的一部分:
1 | |
1 | |
Now, finding the pair with the best score only takes a quick loop:
现在,找到得分最高的一对只需快速循环:
1 | |
1 | |
So the first merge to learn is ('a', '##b') -> 'ab', and we add 'ab' to the vocabulary:
因此,要学习的第一个合并是(‘a’,‘##b’)->‘ab’,我们将‘ab’添加到词汇表中:
1 | |
To continue, we need to apply that merge in our splits dictionary. Let’s write another function for this:
要继续,我们需要在我们的‘Splits`词典中应用该合并。让我们为此编写另一个函数:
1 | |
And we can have a look at the result of the first merge:
我们可以看看第一次合并的结果:
1 | |
1 | |
Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 70:
现在我们有了循环所需的所有东西,直到我们学习了所有我们想要的合并。让我们的目标是单词大小为70:
1 | |
We can then look at the generated vocabulary:
然后,我们可以查看生成的词汇表:
1 | |
1 | |
As we can see, compared to BPE, this tokenizer learns parts of words as tokens a bit faster.
正如我们所看到的,与BPE相比,这个记号器学习单词的部分作为记号要快一点。
💡 Using train_new_from_iterator() on the same corpus won’t result in the exact same vocabulary. This is because the 🤗 Tokenizers library does not implement WordPiece for the training (since we are not completely sure of its internals), but uses BPE instead.
在相同的语料库上使用💡_new_from_iterator()不会产生完全相同的词汇表。这是因为BPETokenizers库没有实现用于培训的WordPiess(因为我们不完全确定它的内部结构),而是使用🤗。
To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word. That is, we look for the biggest subword starting at the beginning of the first word and split it, then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:
要对新文本进行标记化,我们对其进行预标记化、拆分,然后对每个单词应用标记化算法。也就是说,我们从第一个单词的开头开始查找最大的子词并将其拆分,然后在第二部分重复该过程,对该单词的其余部分和文本中的以下单词依此类推:
1 | |
Let’s test it on one word that’s in the vocabulary, and another that isn’t:
让我们用一个在词汇表中的单词和另一个不在词汇中的单词来测试它:
1 | |
1 | |
Now, let’s write a function that tokenizes a text:
现在,让我们编写一个标记化文本的函数:
1 | |
We can try it on any text:
我们可以在任何文本上试用它:
1 | |
1 | |
That’s it for the WordPiece algorithm! Now let’s take a look at Unigram.
WordPiess算法到此为止!现在让我们来看看Unigram。
