2-Using_Transformers-3-Tokenizers
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter2/4?fw=pt
Tokenizers
令牌化器
在工作室实验室的可乐公开赛中提问
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.
令牌化器是NLP管道的核心部件之一。它们的作用只有一个:将文本转换为模型可以处理的数据。模型只能处理数字,因此标记器需要将文本输入转换为数字数据。在这一节中,我们将探索在标记化管道中到底发生了什么。
In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:
在NLP任务中,通常处理的数据是原始文本。以下是这样的文本的一个例子:
1 | |
However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That’s what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.
然而,模型只能处理数字,因此我们需要找到一种将原始文本转换为数字的方法。这就是令牌化者所做的,有很多方法可以做到这一点。目标是找到最有意义的表示–也就是对模型最有意义的表示–如果可能的话,还要找到最小的表示。
Let’s take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization.
让我们来看一些标记化算法的示例,并尝试回答您可能对标记化有疑问的一些问题。
Word-based
基于单词的
The first type of tokenizer that comes to mind is word-based. It’s generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:
脑海中浮现的第一种标记器是基于单词的。它通常很容易设置和使用,只有几条规则,而且通常会产生不错的结果。例如,在下图中,目标是将原始文本分割成单词,并为每个单词找到一个数字表示法:
There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s split() function:
基于单词的标记化的一个示例。基于单词的标记化的一个示例。拆分文本有不同的方法。例如,我们可以使用空格将文本标记化为单词,方法是应用PythonSplit()函数:
1 | |
1 | |
There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a vocabulary is defined by the total number of independent tokens that we have in our corpus.
还有一些单词标记器的变体具有额外的标点符号规则。使用这种标记器,我们可以得到一些相当大的“词汇表”,其中词汇表是由我们语料库中的独立标记词的总数定义的。
Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.
每个单词都被分配了一个ID,从0开始一直到词汇表的大小。该模型使用这些ID来识别每个单词。
If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs. Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.
如果我们想用基于单词的标记器完全覆盖一种语言,我们需要为该语言中的每个单词都有一个标识符,这将生成大量的标记。例如,英语中有超过500,000个单词,因此要构建从每个单词到输入ID的映射,我们需要跟踪这么多ID。此外,像“Dog”这样的单词的表示方式与像“Dogs”这样的单词不同,并且该模型最初将无法知道“Dog”和“Dars”是相似的:它会将这两个单词标识为无关。同样的道理也适用于其他类似的单词,如“run”和“run”,该模型最初不会认为它们是相似的。
Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””. It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.
最后,我们需要一个定制令牌来表示我们的词汇表中没有的单词。这称为“未知”令牌,通常表示为“[unk]”或“”。如果您看到记号赋值器正在生成大量这样的记号,这通常是一个不好的迹象,因为它无法检索单词的合理表示形式,并且在此过程中您会丢失信息。设计词汇表的目标是这样一种方式,即词汇化器将尽可能少的单词标记化到未知的标记中。
One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.
减少未知令牌数量的一种方法是使用基于字符的令牌器深入一级。
Character-based
基于字符的
Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:
基于字符的标记器将文本拆分成字符,而不是单词。这有两个主要好处:
- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
But here too some questions arise concerning spaces and punctuation:
词汇量要小得多。词汇表外(未知)的符号要少得多,因为每个单词都可以由字符组成。但这里也会出现一些关于空格和标点符号的问题:
This approach isn’t perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful: each character doesn’t mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.
基于字符的标记化的一个示例。基于字符的标记化的一个示例。这种方法也不是完美的。由于现在的表示是基于字符而不是单词,人们可能会认为,直观地说,它的意义更小:每个字符本身并没有多大意义,而单词就是如此。然而,这又因语言的不同而不同;例如,在中文中,每个字符比拉丁语中的一个字符承载更多的信息。
Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.
另一件要考虑的事情是,我们的模型最终将处理大量的标记:虽然一个单词将只是一个具有基于单词的标记器的单个标记,但在转换为字符时,它可以很容易地转换为10个或更多的标记。
To get the best of both worlds, we can use a third technique that combines the two approaches: subword tokenization.
为了两全其美,我们可以使用结合了这两种方法的第三种技术:子词标记化。
Subword tokenization
子词标记化
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
子词标记化算法依赖于这样的原则,即频繁使用的词不应该被拆分成更小的子词,而稀有单词应该被分解成有意义的子词。
For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.
例如,“烦人的”可能被认为是一个罕见的词,可以分解为“烦人的”和“讨厌的”。这两个词都可能作为独立的子词出现的频率更高,而同时“恼人的”的意思被“烦人的”和“ly”的复合意义所保留。
Here is an example showing how a subword tokenization algorithm would tokenize the sequence “Let’s do tokenization!“:
下面的示例显示了子词标记化算法将如何标记化序列“We‘s do tokenization!”:
These subwords end up providing a lot of semantic meaning: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.
一种子词标记化算法。一种子词标记化算法。这些子词最终提供了许多语义:例如,在上面的例子中,“标记化”被分成“标记”和“化”,这两个标记既有语义意义,又节省空间(只需要两个标记来表示一个长词)。这使我们能够用较小的词汇量获得相对较好的覆盖率,并且几乎没有未知的标记。
This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.
这种方法在像土耳其语这样的粘性语言中特别有用,在这种语言中,您可以通过将子词串在一起来形成(几乎)任意长的复杂单词。
And more!
还有更多!
Unsurprisingly, there are many more techniques out there. To name a few:
不出所料,市场上还有更多的技术。举几个例子:
- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models
You should now have sufficient knowledge of how tokenizers work to get started with the API.
字节级BPE,如GPT-2 WordPiess中使用的,如BERTSentencePiess或Unigram中使用的,如在几个多语言模型中使用的。现在,您应该有足够的知识来开始使用API。
Loading and saving
加载和保存
Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: from_pretrained() and save_pretrained(). These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).
加载和保存令牌器与使用模型一样简单。实际上,它是基于相同的两种方法:From_PreTraded()和save_preTraded()。这些方法将加载或保存记号赋值器使用的算法(有点像模型的体系结构)及其词汇表(有点像模型的权重)。
Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:
加载使用与BERT相同的检查点训练的BERT标记器与加载模型的方式相同,不同之处在于我们使用BertTokenizer类:
1 | |
Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:
与AutoModel类似,AutoTokenizer类会根据检查点名称抓取库中对应的tokenizer类,可以直接与任何检查点一起使用:
1 | |
We can now use the tokenizer as shown in the previous section:
我们现在可以使用上一节中所示的标记器:
1 | |
1 | |
Saving a tokenizer is identical to saving a model:
保存记号赋值器与保存模型相同:
1 | |
We’ll talk more about token_type_ids in [Chapter 3], and we’ll explain the attention_mask key a little later. First, let’s see how the input_ids are generated. To do this, we’ll need to look at the intermediate methods of the tokenizer.
我们将在第三章中更多地介绍Token_type_ids,稍后我们会解释note_mask键。首先,我们来看看inputids是如何生成的。要做到这一点,我们需要查看记号赋值器的中间方法。
Encoding
编码
Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.
将文本转换为数字称为编码。编码分两步完成:标记化,然后转换为输入ID。
As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.
正如我们已经看到的,第一步是将文本分割成单词(或单词的一部分、标点符号等),通常称为标记。有多个规则可以管理该过程,这就是为什么我们需要使用模型的名称实例化标记器,以确保我们使用的规则与预先训练模型时使用的规则相同。
The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.
第二步是将这些标记转换为数字,这样我们就可以构建一个张量,并将它们提供给模型。为此,标记器有一个词汇表,这是我们使用`From_preTraded()方法实例化词汇表时下载的部分。同样,我们需要使用模型预训练时使用的相同词汇。
To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2).
为了更好地理解这两个步骤,我们将分别探讨它们。请注意,我们将使用一些单独执行部分标记化管道的方法来向您显示这些步骤的中间结果,但在实践中,您应该直接对您的输入调用标记器(如第2节所示)。
Tokenization
标记化
The tokenization process is done by the tokenize() method of the tokenizer:
标记化过程通过标记器的tokenize()方法完成:
1 | |
The output of this method is a list of strings, or tokens:
此方法的输出是字符串或标记的列表:
1 | |
This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.
这个标记器是一个子词标记器:它拆分单词,直到获得可以由其词汇表表示的符号。这里的Transformer就是这种情况,它被拆分成两个标记:Transform和##er。
From tokens to input IDs
从令牌到输入ID
The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:
转换为输入ID由Convert_tokens_to_ids()标记器方法处理:
1 | |
1 | |
These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier in this chapter.
这些输出一旦转换成适当的框架张量,就可以用作模型的输入,如本章前面所述。
✏️ Try it out! Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Check that you get the same input IDs we got earlier!
✏️试试看吧!在第2节中使用的输入语句(“我一生都在等待HuggingFace课程”)上重复最后两个步骤(标记化和转换为输入ID)。和“我太讨厌这个了!”)检查您是否获得了与我们之前获得的相同的输入ID!
Decoding
解码
Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method as follows:
解码的方式正好相反:从词汇表索引中,我们希望得到一个字符串。这可以通过decode()方法实现,如下所示:
1 | |
1 | |
Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).
请注意,decde方法不仅将索引转换回标记,还将作为相同单词一部分的标记组合在一起,以生成可读的句子。当我们使用预测新文本的模型时,这种行为将非常有用(无论是提示生成的文本,还是翻译或摘要等从顺序到顺序的问题)。
By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we’ve just scraped the tip of the iceberg. In the following section, we’ll take our approach to its limits and take a look at how to overcome them.
到目前为止,您应该了解了标记器可以处理的原子操作:标记化、转换为ID以及将ID转换回字符串。然而,我们只是冰山一角。在接下来的部分中,我们将介绍它的局限性,并看看如何克服它们。
