6-The_Tokenizers_library-1-Training_a_new_tokenizer_from_an_old_one

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/2?fw=pt

Training a new tokenizer from an old one

从旧的标记器训练新的标记器

Ask a Question
Open In Colab
Open In Studio Lab
If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to retrain the model from scratch using a tokenizer adapted to your data. That will require training a new tokenizer on your dataset. But what exactly does that mean? When we first looked at tokenizers in [Chapter 2], we saw that most Transformer models use a subword tokenization algorithm. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus — a process we call training. The exact rules that govern this training depend on the type of tokenizer used, and we’ll go over the three main algorithms later in this chapter.

在Studio Lab的Colab Open中提出问题如果语言模型在您感兴趣的语言中不可用,或者如果您的语料库与您的语言模型所用的语料库非常不同,您很可能希望使用适合您的数据的标记器从头开始重新训练模型。这将需要在您的数据集上培训新的标记器。但这到底是什么意思呢?当我们第一次看到第2章中的标记器时,我们看到大多数Transformer模型使用子词标记化算法。为了确定哪些子词是感兴趣的,并且在手头的语料库中出现的频率最高,分词器需要仔细查看语料库中的所有文本-这一过程我们称为培训。管理此培训的确切规则取决于使用的记号赋值器的类型,我们将在本章后面介绍三种主要算法。

⚠️ Training a tokenizer is not the same as training a model! Model training uses stochastic gradient descent to make the loss a little bit smaller for each batch. It’s randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice). Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It’s deterministic, meaning you always get the same results when training with the same algorithm on the same corpus.

⚠️训练一个记号赋值器不同于训练一个模型!模型训练使用随机梯度下降,使每批的损失稍微小一些。它是自然随机的(这意味着你必须设置一些种子,才能在两次相同的训练中获得相同的结果)。训练分词器是一个统计过程,它试图为给定的语料库确定哪些子词是最好的,用于挑选它们的确切规则取决于分词化算法。它是确定性的,这意味着当在相同的语料库上使用相同的算法进行训练时,您总是得到相同的结果。

Assembling a corpus

语料库的汇编

There’s a very simple API in 🤗 Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: AutoTokenizer.train_new_from_iterator(). To see this in action, let’s say we want to train GPT-2 from scratch, but in a language other than English. Our first task will be to gather lots of data in that language in a training corpus. To provide examples everyone will be able to understand, we won’t use a language like Russian or Chinese here, but rather a specialized English language: Python code.

在🤗Transformer中有一个非常简单的接口,您可以使用它来训练一个新的标记器,它具有与现有标记器相同的特征:AutoTokenizer.train_new_from_iterator()。为了在实践中看到这一点,让我们假设我们想要从头开始训练GPT-2,但要用英语以外的语言。我们的第一个任务将是在训练语料库中收集大量该语言的数据。为了提供每个人都能理解的示例,我们在这里不会使用像俄语或中文这样的语言,而是使用一种专门的英语:Python代码。

The 🤗 Datasets library can help us assemble a corpus of Python source code. We’ll use the usual load_dataset() function to download and cache the CodeSearchNet dataset. This dataset was created for the CodeSearchNet challenge and contains millions of functions from open source libraries on GitHub in several programming languages. Here, we will load the Python part of this dataset:

PythonDataSets库可以帮助我们汇编🤗源代码语料库。我们将使用常用的Load_DataSet()函数下载并缓存CodeSearchNet数据集。这个数据集是为CodeSearchNet挑战而创建的,它包含来自GitHub上使用多种编程语言的开源库的数百万个函数。在这里,我们将加载此数据集的Python部分:

1
2
3
4
from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python")

We can have a look at the training split to see which columns we have access to:

我们可以查看培训拆分,以了解我们有权访问哪些列:

1
raw_datasets["train"]
1
2
3
4
5
6
7
Dataset({
features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language',
'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name',
'func_code_url'
],
num_rows: 412178
})

We can see the dataset separates docstrings from code and suggests a tokenization of both. Here. we’ll just use the whole_func_string column to train our tokenizer. We can look at an example of one these functions by indexing into the train split:

我们可以看到,数据集将文档字符串与代码分开,并建议对两者进行标记化。这里。我们将只使用Whole_Func_String列来训练我们的记号赋值器。我们可以通过索引到Train拆分来查看这些函数之一的示例:

1
print(raw_datasets["train"][123456]["whole_func_string"])

which should print the following:

它应该打印以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
def handle_simple_responses(
self, timeout_ms=None, info_cb=DEFAULT_MESSAGE_CALLBACK):
"""Accepts normal responses from the device.

Args:
timeout_ms: Timeout in milliseconds to wait for each response.
info_cb: Optional callback for text sent from the bootloader.

Returns:
OKAY packet's message.
"""
return self._accept_responses('OKAY', info_cb, timeout_ms=timeout_ms)

The first thing we need to do is transform the dataset into an iterator of lists of texts — for instance, a list of list of texts. Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once. If your corpus is huge, you will want to take advantage of the fact that 🤗 Datasets does not load everything into RAM but stores the elements of the dataset on disk.

我们需要做的第一件事是将数据集转换为文本列表的迭代器-例如,文本列表列表。使用文本列表将使我们的记号赋值器运行得更快(对成批文本进行训练,而不是逐个处理单个文本),如果我们希望避免所有内容都同时存储在内存中,那么它应该是一个迭代器。如果您的语料库很大,您将希望利用这样一个事实,即🤗数据集不会将所有内容都加载到内存中,而是将数据集的元素存储在磁盘上。

Doing the following would create a list of lists of 1,000 texts each, but would load everything in memory:

执行以下操作将创建一个包含1,000个文本的列表,但会加载内存中的所有内容:

1
2
# Don't uncomment the following line unless your dataset is small!
# training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]

Using a Python generator, we can avoid Python loading anything into memory until it’s actually necessary. To create such a generator, you just to need to replace the brackets with parentheses:

使用Python生成器,我们可以避免在实际需要之前将任何内容加载到内存中。要创建这样的生成器,您只需将括号替换为圆括号:

1
2
3
4
training_corpus = (
raw_datasets["train"][i : i + 1000]["whole_func_string"]
for i in range(0, len(raw_datasets["train"]), 1000)
)

This line of code doesn’t fetch any elements of the dataset; it just creates an object you can use in a Python for loop. The texts will only be loaded when you need them (that is, when you’re at the step of the for loop that requires them), and only 1,000 texts at a time will be loaded. This way you won’t exhaust all your memory even if you are processing a huge dataset.

这行代码不获取DataSet的任何元素;它只创建了一个对象,您可以在一个Pythonfor‘循环中使用它。文本只会在您需要它们的时候加载(即,当您处于需要它们的for‘循环的步骤),并且一次只加载1,000个文本。这样,即使您正在处理一个巨大的数据集,也不会耗尽您所有的内存。

The problem with a generator object is that it can only be used once. So, instead of this giving us the list of the first 10 digits twice:

生成器对象的问题是它只能使用一次。所以,不是这样,而是给我们两次前10位的列表:

1
2
3
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

we get them once and then an empty list:

我们只拿到一次,然后就是一个空名单:

1
2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]

That’s why we define a function that returns a generator instead:

这就是我们定义一个返回生成器的函数的原因:

1
2
3
4
5
6
7
8
def get_training_corpus():
return (
raw_datasets["train"][i : i + 1000]["whole_func_string"]
for i in range(0, len(raw_datasets["train"]), 1000)
)


training_corpus = get_training_corpus()

You can also define your generator inside a for loop by using the yield statement:

您也可以使用yeeld语句在for循环中定义您的生成器:

1
2
3
4
5
def get_training_corpus():
dataset = raw_datasets["train"]
for start_idx in range(0, len(dataset), 1000):
samples = dataset[start_idx : start_idx + 1000]
yield samples["whole_func_string"]

which will produce the exact same generator as before, but allows you to use more complex logic than you can in a list comprehension.

它将生成与前面完全相同的生成器,但允许您使用比列表理解更复杂的逻辑。

Training a new tokenizer

培训新的令牌器

Now that we have our corpus in the form of an iterator of batches of texts, we are ready to train a new tokenizer. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):

现在我们有了批量文本迭代器形式的语料库,我们准备训练一个新的标记器。为此,我们首先需要加载我们想要与我们的模型配对的记号器(这里是GPT-2):

1
2
3
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Even though we are going to train a new tokenizer, it’s a good idea to do this to avoid starting entirely from scratch. This way, we won’t have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.

即使我们要训练一个新的标记器,这样做也是一个好主意,以避免完全从头开始。这样,我们就不必指定任何关于标记化算法或我们想要使用的特殊标记符的内容;我们的新标记器将与GPT-2完全相同,唯一变化的是词汇表,这将由我们语料库上的训练来确定。

First let’s have a look at how this tokenizer would treat an example function:

首先,让我们看一下这个记号赋值器将如何处理一个示例函数:

1
2
3
4
5
6
example = '''def add_numbers(a, b):
"""Add the two numbers `a` and `b`."""
return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens
1
2
['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo',
'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']

This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the _ character.

这个标记器有几个特殊的符号,如ĠĊ,分别表示空格和换行符。正如我们所看到的,这并不是很高效:记号赋值器返回每个空格的单个记号,而它可以将缩进级别组合在一起(因为具有四个或八个空格的集合在代码中将非常常见)。它还有点奇怪地拆分函数名,不习惯看到带有_字符的单词。

Let’s train a new tokenizer and see if it solves those issues. For this, we’ll use the method train_new_from_iterator():

让我们训练一个新的令牌器,看看它是否解决了这些问题。为此,我们将使用方法培训_新_来自_迭代器()

1
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

This command might take a bit of time if your corpus is very large, but for this dataset of 1.6 GB of texts it’s blazing fast (1 minute 16 seconds on an AMD Ryzen 9 3900X CPU with 12 cores).

如果您的语料库非常大,这个命令可能需要一些时间,但对于这个1.6 GB的文本数据集来说,它非常快(在12核的AMD Ryzen9 3900X CPU上是1分16秒)。

Note that AutoTokenizer.train_new_from_iterator() only works if the tokenizer you are using is a “fast” tokenizer. As you’ll see in the next section, the 🤗 Transformers library contains two types of tokenizers: some are written purely in Python and others (the fast ones) are backed by the 🤗 Tokenizers library, which is written in the Rust programming language. Python is the language most often used for data science and deep learning applications, but when anything needs to be parallelized to be fast, it has to be written in another language. For instance, the matrix multiplications that are at the core of the model computation are written in CUDA, an optimized C library for GPUs.

请注意,仅当您使用的标记器是快速标记器时,AutoTokenizer.train_new_from_iterator()才有效。正如您将在下一节中看到的,🤗Transformer库包含两种类型的令牌化器:一些是纯用🤗编写的,另一些(快速的)由Rust语言编写的令牌化器库支持。Python是数据科学和深度学习应用程序最常用的语言,但当任何东西需要并行化才能快速时,它必须用另一种语言编写。例如,作为模型计算核心的矩阵乘法是用CUDA编写的,CUDA是一个针对GPU的优化C库。

Training a brand new tokenizer in pure Python would be excruciatingly slow, which is why we developed the 🤗 Tokenizers library. Note that just as you didn’t have to learn the CUDA language to be able to execute your model on a batch of inputs on a GPU, you won’t need to learn Rust to use a fast tokenizer. The 🤗 Tokenizers library provides Python bindings for many methods that internally call some piece of code in Rust; for example, to parallelize the training of your new tokenizer or, as we saw in [Chapter 3], the tokenization of a batch of inputs.

用纯Python语言训练一个全新的令牌化器会非常慢,这就是我们开发🤗令牌化器库的原因。请注意,正如您不需要学习CUDA语言就可以在GPU上的一批输入上执行模型一样,您也不需要学习Rust就可以使用快速记号赋值器。PythonTokenizers库为许多方法提供了🤗绑定,这些方法在内部调用RUST中的某段代码;例如,并行化新的令牌化器的训练,或者,正如我们在第3章中看到的,对一批输入进行标记化。

Most of the Transformer models have a fast tokenizer available (there are some exceptions that you can check here), and the AutoTokenizer API always selects the fast tokenizer for you if it’s available. In the next section we’ll take a look at some of the other special features fast tokenizers have, which will be really useful for tasks like token classification and question answering. Before diving into that, however, let’s try our brand new tokenizer on the previous example:

大多数Transformer型号都提供了快速令牌化器(有一些例外,您可以在此处查看),并且AutoTokenizerAPI总是为您选择快速令牌化器(如果可用)。在下一节中,我们将看看快速令牌化器拥有的其他一些特殊功能,这些功能对于令牌分类和问题回答等任务非常有用。不过,在深入讨论这一问题之前,让我们在前面的示例中尝试一下我们全新的记号赋值器:

1
2
tokens = tokenizer.tokenize(example)
tokens
1
2
['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`',
'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']

Here we again see the special symbols Ġ and Ċ that denote spaces and newlines, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a ĊĠĠĠ token that represents an indentation, and a Ġ""" token that represents the three quotes that start a docstring. The tokenizer also correctly split the function name on _. This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:

在这里,我们再次看到表示空格和换行符的特殊符号ĠĊ,但我们也可以看到,我们的记号生成器学习了一些高度特定于一组Python函数的记号:例如,有一个表示缩进的ĊĠĠĠ记号,和一个表示开始一个文档字符串的三个引号的Ġ“”记号。记号记号程序还正确地将函数名拆分到_上。这是一个相当紧凑的表示;相比之下,在同一个例子中使用普通英语记号记号程序会得到更长的句子:

1
2
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))
1
2
27
36

Let’s look at another example:

让我们看另一个例子:

1
2
3
4
5
6
7
8
9
example = """class LinearLayer():
def __init__(self, input_size, output_size):
self.weight = torch.randn(input_size, output_size)
self.bias = torch.zeros(output_size)

def __call__(self, x):
return x @ self.weights + self.bias
"""
tokenizer.tokenize(example)
1
2
3
4
5
['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',',
'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_',
'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(',
'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ',
'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: ĊĠĠĠĠĠĠĠ. The special Python words like class, init, call, self, and return are each tokenized as one token, and we can see that as well as splitting on _ and . the tokenizer correctly splits even camel-cased names: LinearLayer is tokenized as ["ĠLinear", "Layer"].

除了缩进对应的标记外,这里我们还可以看到一个双缩进的标记:ĊĠĠĠĠĠĠĠ。特殊的Ġ单词classinitallselfreurur都被标记化为一个标记,我们可以看到,除了对_.进行拆分外,标记化器甚至可以正确地拆分骆驼大小写的名称:LinearLayer被标记为[“Layer线性”,“Layer”]

Saving the tokenizer

保存记号赋值器

To make sure we can use it later, we need to save our new tokenizer. Like for models, this is done with the save_pretrained() method:

为了确保以后可以使用它,我们需要保存新的记号赋值器。与模型类似,这是通过save_preTraded()方法完成的:

1
tokenizer.save_pretrained("code-search-net-tokenizer")

This will create a new folder named code-search-net-tokenizer, which will contain all the files the tokenizer needs to be reloaded. If you want to share this tokenizer with your colleagues and friends, you can upload it to the Hub by logging into your account. If you’re working in a notebook, there’s a convenience function to help you with this:

这将创建一个名为code-earch-net-tokenizer的新文件夹,其中将包含需要重新加载令牌化器的所有文件。如果你想与你的同事和朋友分享这个令牌器,你可以通过登录你的帐户将其上传到Hub。如果你在笔记本上工作,有一个方便的功能可以帮助你做到这一点:

1
2
3
from huggingface_hub import notebook_login

notebook_login()

This will display a widget where you can enter your Hugging Face login credentials. If you aren’t working in a notebook, just type the following line in your terminal:

这将显示一个小部件,您可以在其中输入您的Hugging Face登录凭据。如果您不是在笔记本电脑上工作,只需在您的终端中键入以下行:

1
huggingface-cli login

Once you’ve logged in, you can push your tokenizer by executing the following command:

登录后,您可以通过执行以下命令推送令牌器:

1
tokenizer.push_to_hub("code-search-net-tokenizer")

This will create a new repository in your namespace with the name code-search-net-tokenizer, containing the tokenizer file. You can then load the tokenizer from anywhere with the from_pretrained() method:

这将在您的命名空间中创建一个名为code-earch-net-tokenizer的新存储库,其中包含令牌化器文件。然后,您可以使用From_PreTraded()方法从任何地方加载标记器:

1
2
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

You’re now all set for training a language model from scratch and fine-tuning it on your task at hand! We’ll get to that in [Chapter 7], but first, in the rest of this chapter we’ll take a closer look at fast tokenizers and explore in detail what actually happens when we call the method train_new_from_iterator().

您现在已经准备好从头开始训练语言模型,并根据手头的任务对其进行微调!我们将在第7章讨论这一点,但首先,在本章的其余部分中,我们将更仔细地了解快速标记器,并详细探讨当我们调用方法ran_new_from_iterator()时实际发生了什么。