6-The_Tokenizers_library-7-Unigram_tokenization
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/8?fw=pt
Unigram tokenization
单字标记化
The Unigram algorithm is often used in SentencePiece, which is the tokenization algorithm used by models like AlBERT, T5, mBART, Big Bird, and XLNet.
在Studio Lab的Colab Open中提出问题打开Unigram算法通常用于SentencePiess,这是Albert、T5、mBART、Big Bird和XLNet等模型使用的标记化算法。
💡 This section covers Unigram in depth, going as far as showing a full implementation. You can skip to the end if you just want a general overview of the tokenization algorithm.
💡本节深入介绍了Unigram,甚至展示了一个完整的实现。如果您只想大致了解标记化算法,可以跳到末尾。
Training algorithm
训练算法
Compared to BPE and WordPiece, Unigram works in the other direction: it starts from a big vocabulary and removes tokens from it until it reaches the desired vocabulary size. There are several options to use to build that base vocabulary: we can take the most common substrings in pre-tokenized words, for instance, or apply BPE on the initial corpus with a large vocabulary size.
与BPE和WordPiess相比,Unigram的工作方向是相反的:它从一个大词汇表开始,然后从中删除标记,直到达到所需的词汇表大小。有几个选项可以用来构建基本词汇表:例如,我们可以在预标记化的单词中使用最常见的子字符串,或者在具有较大词汇量的初始语料库上应用BPE。
At each step of the training, the Unigram algorithm computes a loss over the corpus given the current vocabulary. Then, for each symbol in the vocabulary, the algorithm computes how much the overall loss would increase if the symbol was removed, and looks for the symbols that would increase it the least. Those symbols have a lower effect on the overall loss over the corpus, so in a sense they are “less needed” and are the best candidates for removal.
在训练的每一步,Unigram算法计算给定当前词汇的语料库的损失。然后,对于词汇表中的每个符号,该算法计算如果删除该符号,总体损失将增加多少,并寻找增加它最少的符号。这些符号对语料库的总体损失影响较小,因此从某种意义上说,它们“不那么需要”,是最适合删除的。
This is all a very costly operation, so we don’t just remove the single symbol associated with the lowest loss increase, but the ppp (ppp being a hyperparameter you can control, usually 10 or 20) percent of the symbols associated with the lowest loss increase. This process is then repeated until the vocabulary has reached the desired size.
这是一个非常昂贵的操作,因此我们不仅删除与最低损耗增加相关的单个符号,而且删除与最低损耗增加相关的符号的PPP(PPP是您可以控制的超级参数,通常为10%或20%)。然后重复这个过程,直到词汇量达到所需的大小。
Note that we never remove the base characters, to make sure any word can be tokenized.
请注意,我们从不删除基本字符,以确保任何单词都可以被标记化。
Now, this is still a bit vague: the main part of the algorithm is to compute a loss over the corpus and see how it changes when we remove some tokens from the vocabulary, but we haven’t explained how to do this yet. This step relies on the tokenization algorithm of a Unigram model, so we’ll dive into this next.
现在,这仍然有点含糊:算法的主要部分是计算语料库的损失,并看看当我们从词汇表中删除一些标记时它会发生什么变化,但我们还没有解释如何做到这一点。这一步依赖于Unigram模型的标记化算法,因此我们接下来将深入探讨这一点。
We’ll reuse the corpus from the previous examples:
我们将重用前面示例中的语料库:
1 | |
and for this example, we will take all strict substrings for the initial vocabulary :
对于本例,我们将使用所有严格的子字符串作为初始词汇表:
1 | |
Tokenization algorithm
标记化算法
A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in the sense that the probability of token X given the previous context is just the probability of token X. So, if we used a Unigram language model to generate text, we would always predict the most common token.
Unigram模型是一种语言模型,它将每个标记视为独立于其之前的标记。这是最简单的语言模型,在某种意义上说,在给定前一个上下文的情况下,标记X的概率就是标记X的概率。因此,如果我们使用Unigram语言模型来生成文本,我们总是会预测最常见的标记。
The probability of a given token is its frequency (the number of times we find it) in the original corpus, divided by the sum of all frequencies of all tokens in the vocabulary (to make sure the probabilities sum up to 1). For instance, "ug" is present in "hug", "pug", and "hugs", so it has a frequency of 20 in our corpus.
给定标记的概率是它在原始语料库中的频率(我们找到它的次数)除以词汇表中所有标记的所有频率之和(以确保概率总和为1)。例如,“ug”出现在“hug”、“pug”和“hugs”中,因此它在我们的语料库中出现的频率为20。
Here are the frequencies of all the possible subwords in the vocabulary:
以下是词汇表中所有可能的子词的出现频率:
1 | |
So, the sum of all frequencies is 210, and the probability of the subword "ug" is thus 20/210.
因此,所有频率的总和是210,因此子词‘“ug”’的概率是20/210。
✏️ Now your turn! Write the code to compute the the frequencies above and double-check that the results shown are correct, as well as the total sum.
✏️,现在轮到你了!编写代码来计算上面的频率,并再次检查所显示的结果以及总和是否正确。
Now, to tokenize a given word, we look at all the possible segmentations into tokens and compute the probability of each according to the Unigram model. Since all tokens are considered independent, this probability is just the product of the probability of each token. For instance, the tokenization ["p", "u", "g"] of "pug" has the probability:
P([‘‘p”,‘‘u”,‘‘g”])=P(‘‘p”)×P(‘‘u”)×P(‘‘g”)=5210×36210×20210=0.000389P([p", u”, g"]) = P(p”) \times P(u") \times P(g”) = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389P([‘‘p”,‘‘u”,‘‘g”])=P(‘‘p”)×P(‘‘u”)×P(‘‘g”)=2105×21036×21020=0.000389
现在,为了将给定的单词标记化,我们查看所有可能的分词成标记,并根据Unigram模型计算每个分词的概率。由于所有令牌都被认为是独立的,因此该概率只是每个令牌的概率的乘积。例如,“pug”的标记化[“p”,“u”,“g”]具有概率:P([‘’p“,‘’u”,‘‘g“])=P(‘‘p”)×P(‘‘u“)×P(‘‘g”)=5210×36210×20210=0.000389P([p“,‘u”,G“])=P(p”)\×P(u“)\×P(`g”)=\FRAC{5}{210}\TIMES\FRAC{36}{210}\TIMES\FRAC{20}{210}=0.000389P([‘’p“,‘’u”,‘‘g“])=P(‘‘p”)×P(‘‘u“)×P(‘‘g”)=2105×21036×21020=0.000389
Comparatively, the tokenization ["pu", "g"] has the probability:
P([‘‘pu”,‘‘g”])=P(‘‘pu”)×P(‘‘g”)=5210×20210=0.0022676P([pu", g”]) = P(pu") \times P(g”) = \frac{5}{210} \times \frac{20}{210} = 0.0022676P([‘‘pu”,‘‘g”])=P(‘‘pu”)×P(‘‘g”)=2105×21020=0.0022676
相反,标记化[“PU”,“g”]具有概率:P([‘’PU“,‘‘g”])=P(‘‘pu“)×P(‘‘g”)=5210×20210=0.0022676P([pu“,‘g”])=P(’PU“)\乘以P(g”)=\FRAC{5}{210}\FRAC{20}{210}=0.0022676P([‘PU“,‘‘g“])=P(‘‘pu”)×P(‘‘g“)=2105×21020=0.0022676
so that one is way more likely. In general, tokenizations with the least tokens possible will have the highest probability (because of that division by 210 repeated for each token), which corresponds to what we want intuitively: to split a word into the least number of tokens possible.
因此,这种可能性要大得多。通常,具有尽可能少的记号的标记化将具有最高的概率(因为每个记号重复被210除以),这符合我们直观地想要的:将一个单词拆分成尽可能少的记号。
The tokenization of a word with the Unigram model is then the tokenization with the highest probability. In the example of "pug", here are the probabilities we would get for each possible segmentation:
因此,具有Unigram模型的单词的标记化是概率最高的标记化。在“pug”的例子中,对于每个可能的分段,我们将获得的概率如下:
1 | |
So, "pug" would be tokenized as ["p", "ug"] or ["pu", "g"], depending on which of those segmentations is encountered first (note that in a larger corpus, equality cases like this will be rare).
因此,“pug”将被标记化为[“p”,“ug”]或[“PU”,“g”],这取决于首先遇到这些分段中的哪一个(请注意,在更大的语料库中,像这样的相等情况将是罕见的)。
In this case, it was easy to find all the possible segmentations and compute their probabilities, but in general it’s going to be a bit harder. There is a classic algorithm used for this, called the Viterbi algorithm. Essentially, we can build a graph to detect the possible segmentations of a given word by saying there is a branch from character a to character b if the subword from a to b is in the vocabulary, and attribute to that branch the probability of the subword.
在这种情况下,找到所有可能的分段并计算它们的概率很容易,但通常情况下会稍微困难一些。有一种经典的算法用于此,称为维特比算法。基本上,我们可以构建一个图来检测给定单词的可能切分,方法是:如果从a到b的子词在词汇表中,则存在从字符a到字符b的分支,并将子词的概率归因于该分支。
To find the path in that graph that is going to have the best score the Viterbi algorithm determines, for each position in the word, the segmentation with the best score that ends at that position. Since we go from the beginning to the end, that best score can be found by looping through all subwords ending at the current position and then using the best tokenization score from the position this subword begins at. Then, we just have to unroll the path taken to arrive at the end.
为了在该图中找到将具有最佳分数的路径,维特比算法为单词中的每个位置确定具有在该位置结束的最佳分数的分段。因为我们从头到尾,所以可以通过循环遍历在当前位置结束的所有子词,然后使用从该子词开始的位置的最佳标记化分数来找到最佳分数。然后,我们只需展开到达终点所采取的道路。
Let’s take a look at an example using our vocabulary and the word "unhug". For each position, the subwords with the best scores ending there are the following:
让我们来看一个使用我们的词汇和单词‘unhug’的例子。对于每个职位,得分最高的子词结尾如下:
1 | |
Thus "unhug" would be tokenized as ["un", "hug"].
因此,“unhug”将被标记为[“un”,“hug”]。
✏️ Now your turn! Determine the tokenization of the word "huggun", and its score.
✏️,现在轮到你了!确定单词‘Huggan’的标记化,以及它的分数。
Back to training
回到训练中去
Now that we have seen how the tokenization works, we can dive a little more deeply into the loss used during training. At any given stage, this loss is computed by tokenizing every word in the corpus, using the current vocabulary and the Unigram model determined by the frequencies of each token in the corpus (as seen before).
现在我们已经看到了标记化是如何工作的,我们可以更深入地研究在培训期间使用的损失。在任何给定阶段,通过使用当前词汇和由语料库中每个单词的频率确定的Unigram模型来对语料库中的每个单词进行标记化来计算这种损失(如前所述)。
Each word in the corpus has a score, and the loss is the negative log likelihood of those scores — that is, the sum for all the words in the corpus of all the -log(P(word)).
语料库中的每个单词都有一个分数,损失是这些分数的负对数可能性-即所有-log(P(单词))的语料库中所有单词的总和。
Let’s go back to our example with the following corpus:
让我们使用以下语料库返回我们的示例:
1 | |
The tokenization of each word with their respective scores is:
每个单词及其各自的分数的标记化是:
1 | |
So the loss is:
因此,损失是:
1 | |
Now we need to compute how removing each token affects the loss. This is rather tedious, so we’ll just do it for two tokens here and save the whole process for when we have code to help us. In this (very) particular case, we had two equivalent tokenizations of all the words: as we saw earlier, for example, "pug" could be tokenized ["p", "ug"] with the same score. Thus, removing the "pu" token from the vocabulary will give the exact same loss.
现在我们需要计算移除每个令牌对损失的影响。这相当单调乏味,所以我们在这里只为两个令牌做这件事,并将整个过程留到有代码帮助我们时进行。在这个(非常)特殊的例子中,我们对所有单词有两个等价的标记化:例如,正如我们前面看到的,“pug”可以标记化为具有相同分数的[“p”,“ug”]。因此,从词汇表中删除`“PU”‘标记将产生完全相同的损失。
On the other hand, removing "hug" will make the loss worse, because the tokenization of "hug" and "hugs" will become:
另一方面,去掉拥抱‘会让损失更大,因为拥抱’和‘拥抱’的标记化会变成:
1 | |
These changes will cause the loss to rise by:
这些变化将导致亏损上升:
1 | |
Therefore, the token "pu" will probably be removed from the vocabulary, but not "hug".
因此,代币“PU”可能会从词汇中删除,但不会从“Hug”中删除。
Implementing Unigram
实施Unigram
Now let’s implement everything we’ve seen so far in code. Like with BPE and WordPiece, this is not an efficient implementation of the Unigram algorithm (quite the opposite), but it should help you understand it a bit better.
现在,让我们用代码实现我们到目前为止看到的所有内容。与BPE和WordPiess一样,这不是Unigram算法的有效实现(恰恰相反),但它应该会帮助您更好地理解它。
We will use the same corpus as before as an example:
我们将使用与前面相同的语料库作为示例:
1 | |
This time, we will use xlnet-base-cased as our model:
这次我们将使用xlnet-base-case作为我们的模型:
1 | |
Like for BPE and WordPiece, we begin by counting the number of occurrences of each word in the corpus:
与BPE和WordPiess一样,我们首先计算语料库中每个单词的出现次数:
1 | |
Then, we need to initialize our vocabulary to something larger than the vocab size we will want at the end. We have to include all the basic characters (otherwise we won’t be able to tokenize every word), but for the bigger substrings we’ll only keep the most common ones, so we sort them by frequency:
然后,我们需要将词汇表初始化为大于最后所需的词汇表大小。我们必须包含所有基本字符(否则我们将无法对每个单词进行标记化),但对于较大的子字符串,我们将只保留最常见的子字符串,因此我们按频率对它们进行排序:
1 | |
1 | |
We group the characters with the best subwords to arrive at an initial vocabulary of size 300:
我们将字符与最好的子词进行分组,以得出300大小的初始词汇量:
1 | |
💡 SentencePiece uses a more efficient algorithm called Enhanced Suffix Array (ESA) to create the initial vocabulary.
💡SentencePiess使用一种名为增强后缀数组(ESA)的更高效的算法来创建初始词汇表。
Next, we compute the sum of all frequencies, to convert the frequencies into probabilities. For our model we will store the logarithms of the probabilities, because it’s more numerically stable to add logarithms than to multiply small numbers, and this will simplify the computation of the loss of the model:
接下来,我们计算所有频率的和,以将频率转换为概率。对于我们的模型,我们将存储概率的对数,因为加法对数比乘以小数在数值上更稳定,这将简化模型损失的计算:
1 | |
Now the main function is the one that tokenizes words using the Viterbi algorithm. As we saw before, that algorithm computes the best segmentation of each substring of the word, which we will store in a variable named best_segmentations. We will store one dictionary per position in the word (from 0 to its total length), with two keys: the index of the start of the last token in the best segmentation, and the score of the best segmentation. With the index of the start of the last token, we will be able to retrieve the full segmentation once the list is completely populated.
现在,主要函数是使用维特比算法对单词进行标记化的函数。正如我们之前看到的,该算法计算单词的每个子串的最佳切分,我们将其存储在一个名为Best_Segments的变量中。我们将为单词中的每个位置(从0到其总长度)存储一个词典,其中包含两个关键字:最佳切分中最后一个标记开始的索引和最佳切分的分数。有了最后一个令牌开始的索引,一旦列表完全填充,我们将能够检索完整的分段。
Populating the list is done with just two loops: the main loop goes over each start position, and the second loop tries all substrings beginning at that start position. If the substring is in the vocabulary, we have a new segmentation of the word up until that end position, which we compare to what is in best_segmentations.
填充列表只需两个循环:主循环遍历每个开始位置,第二个循环尝试从该开始位置开始的所有子字符串。如果子串在词汇表中,我们就有一个新的单词切分,直到那个结束位置,我们将它与Best_Segments中的切分进行比较。
Once the main loop is finished, we just start from the end and hop from one start position to the next, recording the tokens as we go, until we reach the start of the word:
一旦主循环结束,我们就从结尾开始,从一个开始位置跳到下一个开始位置,一边走一边记录令牌,直到我们到达单词的开头:
1 | |
We can already try our initial model on some words:
我们已经可以在一些单词上尝试我们的初始模型:
1 | |
1 | |
Now it’s easy to compute the loss of the model on the corpus!
现在很容易在语料库上计算模型的损失!
1 | |
We can check it works on the model we have:
我们可以检查它在我们拥有的模型上是否正常工作:
1 | |
1 | |
Computing the scores for each token is not very hard either; we just have to compute the loss for the models obtained by deleting each token:
计算每个令牌的分数也不是很难;我们只需计算通过删除每个令牌获得的模型的损失:
1 | |
We can try it on a given token:
我们可以在给定的令牌上进行尝试:
1 | |
Since "ll" is used in the tokenization of "Hopefully", and removing it will probably make us use the token "l" twice instead, we expect it will have a positive loss. "his" is only used inside the word "This", which is tokenized as itself, so we expect it to have a zero loss. Here are the results:
由于希望的标记化中使用了ll‘,删除它可能会导致我们使用标记“l”两次,因此我们预计它会有一个积极的损失。他的‘只用在’This‘这个词里面,这个词被标示为自己,所以我们预计它是零损失的。以下是结果:
1 | |
💡 This approach is very inefficient, so SentencePiece uses an approximation of the loss of the model without token X: instead of starting from scratch, it just replaces token X by its segmentation in the vocabulary that is left. This way, all the scores can be computed at once at the same time as the model loss.
💡这种方法效率很低,所以SentencePiess使用了没有标记X的模型损失的近似值:它不是从头开始,而是用剩下的词汇表中的分段来替换标记X。这样,就可以在计算模型损失的同时一次计算出所有分数。
With all of this in place, the last thing we need to do is add the special tokens used by the model to the vocabulary, then loop until we have pruned enough tokens from the vocabulary to reach our desired size:
完成所有这些工作后,我们需要做的最后一件事是将模型使用的特殊标记添加到词汇表中,然后循环,直到我们从词汇表中删除足够的标记以达到所需的大小:
1 | |
Then, to tokenize some text, we just need to apply the pre-tokenization and then use our encode_word() function:
然后,要对一些文本进行标记化,我们只需要应用预标记化,然后使用我们的encode_word()函数:
1 | |
1 | |
That’s it for Unigram! Hopefully by now you’re feeling like an expert in all things tokenizer. In the next section, we will delve into the building blocks of the 🤗 Tokenizers library, and show you how you can use them to build your own tokenizer.
Unigram就是这样!希望到现在你已经觉得自己是所有事物的专家了。在下一节中,我们将深入研究🤗令牌化器库的构建块,并向您展示如何使用它们来构建您自己的令牌化器。
