7-Main_NLP_tasks-5-Training_a_causal_language_model_from_scratch

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter7/6?fw=pt

Training a causal language model from scratch

从头开始训练因果语言模型

Ask a Question
Open In Colab
Open In Studio Lab
Up until now, we’ve mostly been using pretrained models and fine-tuning them for new use cases by reusing the weights from pretraining. As we saw in [Chapter 1], this is commonly referred to as transfer learning, and it’s a very successful strategy for applying Transformer models to most real-world use cases where labeled data is sparse. In this chapter, we’ll take a different approach and train a completely new model from scratch. This is a good approach to take if you have a lot of data and it is very different from the pretraining data used for the available models. However, it also requires considerably more compute resources to pretrain a language model than just to fine-tune an existing one. Examples where it can make sense to train a new model include for datasets consisting of musical notes, molecular sequences such as DNA, or programming languages. The latter have recently gained traction thanks to tools such as TabNine and GitHub’s Copilot, powered by OpenAI’s Codex model, that can generate long sequences of code. This task of text generation is best addressed with auto-regressive or causal language models such as GPT-2.

在Studio Lab的Colab Open中提出问题到目前为止,我们主要是使用预先训练的模型,并通过重用预先培训中的权重来针对新的用例对它们进行微调。正如我们在第1章中看到的,这通常被称为转移学习,对于将Transformer模型应用于大多数标签数据稀疏的真实世界用例来说,这是一个非常成功的策略。在本章中,我们将采取一种不同的方法,从头开始训练一个全新的模型。如果您有大量数据,并且它与用于可用模型的预培训数据非常不同,则这是一种很好的方法。然而,与仅仅微调现有语言模型相比,预先训练语言模型也需要相当多的计算资源。训练新模型可能有意义的例子包括由音符、DNA等分子序列或编程语言组成的数据集。后者最近获得了吸引力,这要归功于TabNine和GitHub的Copilot等工具,这些工具基于OpenAI的Codex模型,可以生成长代码序列。这项文本生成任务最适合使用自回归或因果语言模型,如GPT-2。

In this section we will build a scaled-down version of a code generation model: we’ll focus on one-line completions instead of full functions or classes, using a subset of Python code. When working with data in Python you are in frequent contact with the Python data science stack, consisting of the matplotlib, seaborn, pandas, and scikit-learn libraries. When using those frameworks it’s common to need to look up specific commands, so it would be nice if we could use a model to complete these calls for us.

在这一节中,我们将构建代码生成模型的缩小版本:我们将使用部分Python代码关注一行完成,而不是完整的函数或类。在使用Python语言处理数据时,您会频繁接触到由matplotlibsebornanda ascerkit-learn库组成的Python数据科学堆栈。在使用这些框架时,通常需要查找特定的命令,所以如果我们可以使用一个模型来为我们完成这些调用,那就太好了。

In [Chapter 6] we created an efficient tokenizer to process Python source code, but what we still need is a large-scale dataset to pretrain a model on. Here, we’ll apply our tokenizer to a corpus of Python code derived from GitHub repositories. We will then use the Trainer API and 🤗 Accelerate to train the model. Let’s get to it!

在第6章中,我们创建了一个高效的标记器来处理Python源代码,但我们仍然需要一个大规模的数据集来预先训练模型。在这里,我们将把我们的记号赋值器应用于派生自GitHub存储库的一组Python代码。然后我们将使用Traine接口和🤗Accelerate对模型进行训练。让我们开始吧!

This is actually showcasing the model that was trained and uploaded to the Hub using the code shown in this section. You can find it here. Note that since there is some randomization happening in the text generation, you will probably get a slightly different result.

这实际上展示了使用本节中显示的代码训练并上载到Hub的模型。你都能在这找到你的需要。请注意,由于在文本生成过程中发生了一些随机性,因此您可能会得到略有不同的结果。

Gathering the data

收集数据

Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the Transformers textbook to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called codeparrot, the authors built a dataset that they then shared on the Hugging Face Hub.

从GitHub等代码库中可以获得大量的Python代码,我们可以使用它来创建数据集,方法是从每个Python库中抓取数据。这是《Transformer》教科书中预先训练大型GPT-2模型所采用的方法。在GitHub网站上,作者们使用了一个大约180 GB的文件,其中包含了大约2000万个名为“codeparrott”的Python文件。他们建立了一个数据集,并将其分享到了Hugging Face中心上。

However, training on the full corpus is time- and compute-consuming, and we only need the subset of the dataset concerned with the Python data science stack. So, let’s start by filtering the codeparrot dataset for all files that include any of the libraries in this stack. Because of the dataset’s size, we want to avoid downloading it; instead, we’ll use the streaming feature to filter it on the fly. To help us filter the code samples using the libraries we mentioned earlier, we’ll use the following function:

然而,在整个语料库上进行训练是耗时和计算的,而且我们只需要与Python数据科学堆栈相关的数据集的子集。因此,让我们首先从筛选包含该堆栈中的任意库的所有文件的codeparot数据集开始。由于数据集的大小,我们希望避免下载它;相反,我们将使用流功能来动态过滤它。为了帮助我们使用前面提到的库过滤代码示例,我们将使用以下函数:

1
2
3
4
5
def any_keyword_in_string(string, keywords):
for keyword in keywords:
if keyword in string:
return True
return False

Let’s test it on two examples:

让我们通过两个例子来测试它:

1
2
3
4
5
6
7
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]
example_1 = "import numpy as np"
example_2 = "import pandas as pd"

print(
any_keyword_in_string(example_1, filters), any_keyword_in_string(example_2, filters)
)
1
False True

We can use this to create a function that will stream the dataset and filter the elements we want:

我们可以使用它来创建一个函数,该函数将流式传输数据集并过滤我们想要的元素:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset


def filter_streaming_dataset(dataset, filters):
filtered_dict = defaultdict(list)
total = 0
for sample in tqdm(iter(dataset)):
total += 1
if any_keyword_in_string(sample["content"], filters):
for k, v in sample.items():
filtered_dict[k].append(v)
print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
return Dataset.from_dict(filtered_dict)

Then we can simply apply this function to the streaming dataset:

然后,我们可以简单地将此函数应用于流数据集:

1
2
3
4
5
6
7
8
9
# This cell will take a very long time to execute, so you should skip it and go to
# the next one!
from datasets import load_dataset

split = "train" # "valid"
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]

data = load_dataset(f"transformersbook/codeparrot-{split}", split=split, streaming=True)
filtered_data = filter_streaming_dataset(data, filters)
1
3.26% of data after filtering.

This leaves us with about 3% of the original dataset, which is still quite sizable — the resulting dataset is 6 GB and consists of 600,000 Python scripts!

这使得我们只剩下原始数据集的3%,这仍然是相当大的-结果数据集是6 GB,由600,000个Python脚本组成!

Filtering the full dataset can take 2-3h depending on your machine and bandwidth. If you don’t want to go through this lengthy process yourself, we provide the filtered dataset on the Hub for you to download:

根据您的机器和带宽,筛选完整的数据集可能需要2-3小时。如果您不想自己经历这个漫长的过程,我们提供了Hub上的过滤数据集供您下载:

1
2
3
4
5
6
7
8
9
10
11
12
13
from datasets import load_dataset, DatasetDict

ds_train = load_dataset("huggingface-course/codeparrot-ds-train", split="train")
ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid", split="validation")

raw_datasets = DatasetDict(
{
"train": ds_train, # .shuffle().select(range(50000)),
"valid": ds_valid, # .shuffle().select(range(500))
}
)

raw_datasets
1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
num_rows: 606720
})
valid: Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
num_rows: 3322
})
})

Pretraining the language model will take a while. We suggest that you first run the training loop on a sample of the data by uncommenting the two partial lines above, and make sure that the training successfully completes and the models are stored. Nothing is more frustrating than a training run failing at the last step because you forgot to create a folder or because there’s a typo at the end of the training loop!

对语言模型进行预培训需要一段时间。我们建议您首先通过取消注释上面的两行来对数据样本运行训练循环,并确保训练成功完成并存储了模型。没有什么比训练运行在最后一步失败更令人沮丧的了,因为你忘了创建一个文件夹,或者训练循环的末尾有一个打字错误!

Let’s look at an example from the dataset. We’ll just show the first 200 characters of each field:

让我们看看数据集中的一个例子。我们只显示每个字段的前200个字符:

1
2
for key in raw_datasets["train"][0]:
print(f"{key.upper()}: {raw_datasets['train'][0][key][:200]}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
'REPO_NAME: kmike/scikit-learn'
'PATH: sklearn/utils/__init__.py'
'COPIES: 3'
'SIZE: 10094'
'''CONTENT: """
The :mod:`sklearn.utils` module includes various utilites.
"""

from collections import Sequence

import numpy as np
from scipy.sparse import issparse
import warnings

from .murmurhash import murm
LICENSE: bsd-3-clause'''

We can see that the content field contains the code that we want our model to train on. Now that we have a dataset, we need to prepare the texts so they’re in a format suitable for pretraining.

我们可以看到,`Content‘字段包含我们希望我们的模型训练所依据的代码。现在我们有了一个数据集,我们需要准备文本,使它们采用适合预培训的格式。

Preparing the dataset

准备数据集

The first step will be to tokenize the data, so we can use it for training. Since our goal is to mainly autocomplete short function calls, we can keep the context size relatively small. This has the benefit that we can train the model much faster and it requires significantly less memory. If it is important for your application to have more context (for example, if you want the model to write unit tests based on a file with the function definition), make sure you increase that number, but also keep in mind that this comes with a greater GPU memory footprint. For now, let’s fix the context size at 128 tokens, as opposed to the 1,024 or 2,048 used in GPT-2 or GPT-3, respectively.

第一步是将数据标记化,以便我们可以将其用于培训。由于我们的目标主要是自动完成短函数调用,因此可以保持相对较小的上下文大小。这样做的好处是我们可以更快地训练模型,并且它需要的内存要少得多。如果您的应用程序有更多的上下文很重要(例如,如果您希望模型基于带有函数定义的文件编写单元测试),请确保您增加了这个数字,但也要记住,这会带来更大的GPU内存占用。现在,让我们将上下文大小固定为128个令牌,而不是分别在GPT-2或GPT-3中使用的1024或2048。

Most documents contain many more than 128 tokens, so simply truncating the inputs to the maximum length would eliminate a large fraction of our dataset. Instead, we’ll use the return_overflowing_tokens option to tokenize the whole input and split it into several chunks, as we did in Chapter 6. We’ll also use the return_length option to return the length of each created chunk automatically. Often the last chunk will be smaller than the context size, and we’ll get rid of these pieces to avoid padding issues; we don’t really need them as we have plenty of data anyway.

大多数文档包含的标记远不止128个,因此只需将输入截断到最大长度就可以消除很大一部分数据集。相反,我们将使用Return_overflow_tokens选项来标记化整个输入并将其拆分成几个块,就像我们在第6章中所做的那样。我们还将使用Return_Length选项自动返回每个创建的块的长度。通常,最后一个块将小于上下文大小,我们将删除这些块以避免填充问题;我们并不真正需要它们,因为我们有大量的数据。

Chunking a large texts in several pieces.
Chunking a large texts in several pieces.
Let’s see exactly how this works by looking at the first two examples:

把一大段文字分成几块。把一大段文字分成几块。让我们通过查看前两个示例来确切了解它是如何工作的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
raw_datasets["train"][:2]["content"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")
1
2
3
Input IDs length: 34
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 117, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

We can see that we get 34 segments in total from those two examples. Looking at the chunk lengths, we can see that the chunks at the ends of both documents have less than 128 tokens (117 and 41, respectively). These represent just a small fraction of the total chunks that we have, so we can safely throw them away. With the overflow_to_sample_mapping field, we can also reconstruct which chunks belonged to which input samples.

我们可以看到,从这两个例子中,我们总共得到了34个片段。查看数据块长度,我们可以看到两个文档末尾的数据块只有不到128个令牌(分别为117和41)。这些只占我们所拥有的全部块的一小部分,所以我们可以安全地扔掉它们。通过overflow_to_Sample_mapping字段,我们还可以重构哪些块属于哪些输入样本。

With this operation we’re using a handy feature of the Dataset.map() function in 🤗 Datasets, which is that it does not require one-to-one maps; as we saw in section 3, we can create batches with more or fewer elements than the input batch. This is useful when doing operations like data augmentation or data filtering that change the number of elements. In our case, when tokenizing each element into chunks of the specified context size, we create many samples from each document. We just need to make sure to delete the existing columns, since they have a conflicting size. If we wanted to keep them, we could repeat them appropriately and return them within the Dataset.map() call:

通过这个操作,我们在🤗数据集中使用了Dataset.map()函数的一个方便的特性,即它不需要一对一的映射;正如我们在第3节中看到的,我们可以创建具有比输入批更多或更少的元素的批。在执行数据增强或数据筛选等更改元素数量的操作时,这很有用。在我们的例子中,当将每个元素标记化为指定上下文大小的块时,我们从每个文档创建了许多样本。我们只需要确保删除现有的列,因为它们具有冲突的大小。如果我们想保留它们,我们可以适当地重复它们,并在Dataset.map()调用中返回它们:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def tokenize(element):
outputs = tokenizer(
element["content"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)
input_batch = []
for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
if length == context_length:
input_batch.append(input_ids)
return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets
1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['input_ids'],
num_rows: 16702061
})
valid: Dataset({
features: ['input_ids'],
num_rows: 93164
})
})

We now have 16.7 million examples with 128 tokens each, which corresponds to about 2.1 billion tokens in total. For reference, OpenAI’s GPT-3 and Codex models are trained on 300 and 100 billion tokens, respectively, where the Codex models are initialized from the GPT-3 checkpoints. Our goal in this section is not to compete with these models, which can generate long, coherent texts, but to create a scaled-down version providing a quick autocomplete function for data scientists.

我们现在有1670万个例子,每个例子有128个令牌,相当于总共大约21亿个令牌。作为参考,OpenAI的GPT-3和Codex模型分别针对300和1000亿个令牌进行训练,其中Codex模型是从GPT-3检查点初始化的。我们在这一部分的目标不是与这些模型竞争,它们可以生成冗长、连贯的文本,而是创建一个缩小版本,为数据科学家提供快速自动完成功能。

Now that we have the dataset ready, let’s set up the model!

现在我们已经准备好了数据集,让我们来建立模型!

✏️ Try it out! Getting rid of all the chunks that are smaller than the context size wasn’t a big issue here because we’re using small context windows. As you increase the context size (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an eos_token_id token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify the tokenize() function to make use of that approach. Note that you’ll want to set truncation=False and remove the other arguments from the tokenizer to get the full sequence of token IDs.

✏️试试看吧!删除所有小于上下文大小的块在这里不是一个大问题,因为我们使用的是小的上下文窗口。随着上下文大小的增加(或者如果您有一个较短的文档语料库),丢弃的块的比例也会增加。准备数据的一种更有效的方法是将所有标记化的样本连接到一个批中,中间带有一个eos_Token_id‘标记,然后对连接的序列执行分块。作为练习,修改tokenize()函数以利用该方法。请注意,您需要设置truncation=False`,并从记号赋值器中删除其他参数,以获得令牌ID的完整序列。

Initializing a new model

正在初始化新模型

Our first step is to freshly initialize a GPT-2 model. We’ll use the same configuration for our model as for the small GPT-2 model, so we load the pretrained configuration, make sure that the tokenizer size matches the model vocabulary size and pass the bos and eos (beginning and end of sequence) token IDs:

我们的第一步是重新初始化GPT-2模型。我们将对我们的模型使用与小型GPT-2模型相同的配置,因此我们加载预先训练的配置,确保令牌器大小与模型词汇大小匹配,并传递Boseos(序列的开头和结尾)令牌ID:

1
2
3
4
5
6
7
8
9
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)

With that configuration, we can load a new model. Note that this is the first time we don’t use the from_pretrained() function, since we’re actually initializing a model ourself:

有了这个配置,我们就可以加载一个新型号。请注意,这是我们第一次不使用From_PreTraded()函数,因为我们实际上是自己初始化一个模型:

1
2
3
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")
1
GPT-2 size: 124.2M parameters

Our model has 124M parameters that we’ll have to tune. Before we can start training, we need to set up a data collator that will take care of creating the batches. We can use the DataCollatorForLanguageModeling collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the input_ids.

我们的模型有1.24亿个参数,我们必须对其进行调整。在我们可以开始培训之前,我们需要设置一个负责创建批处理的数据校验器。我们可以使用DataCollatorForLanguageModeling排序器,它是专门为语言建模而设计的(顾名思义)。除了堆叠和填充批次,它还负责创建语言模型标签-在因果语言建模中,输入也用作标签(只是移动了一个元素),该数据校验器在训练期间动态创建它们,因此我们不需要复制`input_ids‘。

Note that DataCollatorForLanguageModeling supports both masked language modeling (MLM) and causal language modeling (CLM). By default it prepares data for MLM, but we can switch to CLM by setting the argument mlm=False:

请注意,DataCollatorForLanguageModeling同时支持屏蔽语言建模(MLM)和因果语言建模(CLM)。默认情况下为传销准备数据,但我们可以通过设置参数mlm=False切换到CLM:

1
2
3
4
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Let’s have a look at an example:

让我们来看一个例子:

1
2
3
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
print(f"{key} shape: {out[key].shape}")
1
2
3
input_ids shape: torch.Size([5, 128])
attention_mask shape: torch.Size([5, 128])
labels shape: torch.Size([5, 128])

We can see that the examples have been stacked and all the tensors have the same shape.

我们可以看到,这些例子已经堆叠在一起,所有的张量都具有相同的形状。

⚠️ Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.

⚠️移动输入和标签以使它们对齐发生在模型内部,因此数据校验器只复制输入来创建标签。

Now we have everything in place to actually train our model — that wasn’t so much work after all! Before we start training we should log in to Hugging Face. If you’re working in a notebook, you can do so with the following utility function:

现在我们已经准备好了实际训练我们的模型的一切–这毕竟不是那么多的工作!在我们开始训练之前,我们应该登录Hugging Face。如果您使用的是笔记本电脑,则可以使用以下实用程序函数执行此操作:

1
2
3
from huggingface_hub import notebook_login

notebook_login()

This will display a widget where you can enter your Hugging Face login credentials.

这将显示一个小部件,您可以在其中输入您的Hugging Face登录凭据。

If you aren’t working in a notebook, just type the following line in your terminal:

如果您不是在笔记本电脑上工作,只需在您的终端中键入以下行:

1
huggingface-cli login

All that’s left to do is configure the training arguments and fire up the Trainer. We’ll use a cosine learning rate schedule with some warmup and an effective batch size of 256 (per_device_train_batch_size * gradient_accumulation_steps). Gradient accumulation is used when a single batch does not fit into memory, and incrementally builds up the gradient through several forward/backward passes. We’ll see this in action when we create the training loop with 🤗 Accelerate.

剩下的工作就是配置训练参数并启动Traine。我们将使用余弦学习速率时间表,并进行一些预热,有效批次大小为256(per_Device_Train_Batch_size*GRIDATION_CONTRAGE_STEPS)。当单个批次无法装入内存时,使用梯度累积,并通过几次向前/向后传递递增地建立梯度。当我们使用🤗Accelerate创建训练循环时,我们将看到这一点的实际效果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
output_dir="codeparrot-ds",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
evaluation_strategy="steps",
eval_steps=5_000,
logging_steps=5_000,
gradient_accumulation_steps=8,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
push_to_hub=True,
)

trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
)

Now we can just start the Trainer and wait for training to finish. Depending on whether you run it on the full or a subset of the training set this will take 20 or 2 hours, respectively, so grab a few coffees and a good book to read!

现在我们可以开始‘训练’,等待训练结束。根据你是在训练集的全集还是子集上运行它,这将分别需要20个小时或2个小时,所以抓紧几杯咖啡和一本好书来阅读!

1
trainer.train()

After training completes, we can push the model and tokenizer to the Hub:

培训完成后,我们可以将模型和令牌器推送到集线器:

1
trainer.push_to_hub()

✏️ Try it out! It only took us about 30 lines of code in addition to the TrainingArguments to get from raw texts to training GPT-2. Try it out with your own dataset and see if you can get good results!

✏️试试看吧!除了`TrainingArguments‘之外,我们只需要大约30行代码就可以从原始文本转换到GPT-2训练。用您自己的数据集尝试一下,看看是否能获得好的结果!

💡 If you have access to a machine with multiple GPUs, try to run the code there. The Trainer automatically manages multiple machines, and this can speed up training tremendously.

💡如果您可以访问具有多个GPU的计算机,请尝试在那里运行代码。“训练器”可以自动管理多台机器,这可以极大地加快训练速度。

Code generation with a pipeline

使用流水线生成代码

Now is the moment of truth: let’s see how well the trained model actually works! We can see in the logs that the loss went down steadily, but to put the model to the test let’s take a look at how well it works on some prompts. To do that we’ll wrap the model in a text generation pipeline, and we’ll put it on the GPU for fast generations if there is one available:

现在是关键时刻:让我们看看经过训练的模型实际工作得有多好!我们可以在日志中看到损失稳步下降,但为了对模型进行测试,让我们看看它在一些提示下运行得如何。为了做到这一点,我们将把模型包装在一个文本生成`Pipeline‘中,如果有的话,我们将把它放在GPU上进行快速生成:

1
2
3
4
5
6
7
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
"text-generation", model="huggingface-course/codeparrot-ds", device=device
)

Let’s start with the simple task of creating a scatter plot:

让我们从创建散点图的简单任务开始:

1
2
3
4
5
6
7
8
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
1
2
3
4
5
6
7
8
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
plt.scatter(x, y)

# create scatter

The result looks correct. Does it also work for a pandas operation? Let’s see if we can create a DataFrame from two arrays:

结果看起来是正确的。它也适用于‘熊猫’手术吗?让我们看看是否可以从两个数组创建一个DataFrame

1
2
3
4
5
6
7
8
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
1
2
3
4
5
6
7
8
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
df = pd.DataFrame({'x': x, 'y': y})
df.insert(0,'x', x)
for

Nice, that’s the correct answer — although it then inserts the column x again. Since the number of generated tokens is limited, the following for loop is cut off. Let’s see if we can do something a bit more complex and have the model help us use the groupby operation:

很好,这是正确的答案–尽管它会再次插入列x‘。由于生成的令牌数量有限,下面的for‘循环被截断。让我们看看我们是否可以做一些更复杂的事情,并让模型帮助我们使用`groupby‘操作:

1
2
3
4
5
6
7
txt = """\
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
1
2
3
4
5
6
7
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
profession = df.groupby(['profession']).mean()

# compute the

Not bad; that’s the right way to do it. Finally, let’s see if we can also use it for scikit-learn and set up a Random Forest model:

还不错,这才是正确的做法。最后,让我们看看是否也可以将其用于cerkit-learn,并建立随机森林模型:

1
2
3
4
5
6
7
txt = """
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
1
2
3
4
5
6
7
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
rf = RandomForestRegressor(n_estimators=300, random_state=random_state, max_depth=3)
rf.fit(X, y)
rf

Looking at these few examples, it seems that the model has learned some of the syntax of the Python data science stack (of course, we would need to evaluate it more thoroughly before deploying the model in the real world). Sometimes it requires more customization of the model training to achieve the necessary performance for a given use case, however. For example, what if we would like to dynamically update the batch size or have a conditional training loop that skips bad examples on the fly? One option would be to subclass the Trainer and add the necessary changes, but sometimes it’s simpler to write the training loop from scratch. That’s where 🤗 Accelerate comes in.

从这几个例子可以看出,模型似乎已经学习了一些Python数据科学堆栈的语法(当然,在将模型部署到现实世界之前,我们需要对其进行更彻底的评估)。然而,有时需要对模型培训进行更多的定制,以获得给定用例所需的性能。例如,如果我们想要动态更新批处理大小,或者有一个条件训练循环来动态跳过不好的示例,该怎么办?一种选择是将Traine子类化并添加必要的更改,但有时从头开始编写训练循环会更简单。这就是🤗Accelerate发挥作用的地方。

Training with 🤗 Accelerate

使用🤗加速进行培训

We’ve seen how to train a model with the Trainer, which can allow for some customization. However, sometimes we want full control over the training loop, or we want to make some exotic changes. In this case 🤗 Accelerate is a great choice, and in this section we’ll go through the steps to use it to train our model. To make things more interesting, we’ll also add a twist to the training loop.

我们已经看到了如何使用Traine训练模型,它可以进行一些定制。然而,有时我们想要完全控制训练循环,或者我们想要做出一些奇特的改变。在这种情况下,🤗Accelerate是一个很好的选择,在本节中,我们将介绍使用它来训练我们的模型的步骤。为了让事情更有趣,我们还将在训练循环中添加一个转折。

Since we are mainly interested in sensible autocompletion for the the data science libraries, it makes sense to give more weight to training samples that make more use of these libraries. We can easily identify these examples through the use of keywords such as plt, pd, sk, fit, and predict, which are the most frequent import names for matplotlib.pyplot, pandas, and sklearn as well as the fit/predict pattern of the latter. If these are each represented as a single token, we can easily check if they occur in the input sequence. Tokens can have a whitespace prefix, so we’ll also check for those versions in the tokenizer vocabulary. To verify that it works, we’ll add one test token which should be split into multiple tokens:

由于我们主要对数据科学库的合理自动完成感兴趣,因此将更多的权重给予更多使用这些库的训练样本是有意义的。我们可以通过使用pltpdskfitpredict等关键字来轻松识别这些示例,这些关键字是matplotlib.pyplotanda assklearn的最常见的导入名称,以及后者的Fit/Forecast模式。如果每个标记都表示为单个令牌,那么我们可以很容易地检查它们是否出现在输入序列中。令牌可以有空格前缀,因此我们还将在令牌化器词汇表中检查这些版本。为了验证它是否正常工作,我们将添加一个测试令牌,该令牌应该拆分成多个令牌:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
keytoken_ids = []
for keyword in [
"plt",
"pd",
"sk",
"fit",
"predict",
" plt",
" pd",
" sk",
" fit",
" predict",
"testtest",
]:
ids = tokenizer([keyword]).input_ids[0]
if len(ids) == 1:
keytoken_ids.append(ids[0])
else:
print(f"Keyword has not single token: {keyword}")
1
'Keyword has not single token: testtest'

Great, that seems to work nicely! We can now write a custom loss function that takes the input sequence, the logits, and the key tokens we just selected as inputs. First we need to align the logits and inputs: the input sequence shifted by one to the right forms the labels, since the next token is the label for the current token. We can achieve this by starting the labels from the second token of the input sequence, since the model does not make a prediction for the first token anyway. Then we cut off the last logit, as we don’t have a label for the token that follows the full input sequence. With that we can compute the loss per sample and count the occurrences of all keywords in each sample. Finally, we calculate the weighted average over all samples using the occurrences as weights. Since we don’t want to throw away all the samples that have no keywords, we add 1 to the weights:

太好了,这似乎很管用!现在,我们可以编写一个自定义损失函数,该函数将输入序列、Logit和我们刚刚选择的密钥令牌作为输入。首先,我们需要对齐Logit和输入:向右移位1的输入序列形成标签,因为下一个令牌是当前令牌的标签。我们可以通过从输入序列的第二个令牌开始标记来实现这一点,因为该模型无论如何都不会对第一个令牌进行预测。然后,我们去掉最后一个logit,因为我们没有跟随完整输入序列的令牌的标签。这样,我们就可以计算每个样本的损失,并计算每个样本中所有关键字的出现次数。最后,我们使用出现次数作为权重来计算所有样本的加权平均值。因为我们不想丢弃所有没有关键字的样本,所以我们将权重加1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from torch.nn import CrossEntropyLoss
import torch


def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
# Shift so that tokens < n predict n
shift_labels = inputs[..., 1:].contiguous()
shift_logits = logits[..., :-1, :].contiguous()
# Calculate per-token loss
loss_fct = CrossEntropyLoss(reduce=False)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
# Resize and average loss per sample
loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
# Calculate and scale weighting
weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
axis=[0, 2]
)
weights = alpha * (1.0 + weights)
# Calculate weighted average
weighted_loss = (loss_per_sample * weights).mean()
return weighted_loss

Before we can start training with this awesome new loss function, we need to prepare a few things:

在我们开始使用这一令人惊叹的新损失功能之前,我们需要准备一些东西:

  • We need dataloaders to load the data in batches.
  • We need to set up weight decay parameters.
  • From time to time we want to evaluate, so it makes sense to wrap the evaluation code in a function.

Let’s start with the dataloaders. We only need to set the dataset’s format to "torch", and then we can pass it to a PyTorch DataLoader with the appropriate batch size:

我们需要数据加载器来批量加载数据。我们需要设置权重衰减参数。我们经常想要评估,所以将评估代码包装在一个函数中是有意义的。让我们从数据加载器开始。我们只需要将数据集的格式设置为“Torch”,然后我们就可以将其传递给一个具有合适批量大小的PyTorchDataLoader

1
2
3
4
5
from torch.utils.data.dataloader import DataLoader

tokenized_dataset.set_format("torch")
train_dataloader = DataLoader(tokenized_dataset["train"], batch_size=32, shuffle=True)
eval_dataloader = DataLoader(tokenized_dataset["valid"], batch_size=32)

Next, we group the parameters so that the optimizer knows which ones will get an additional weight decay. Usually, all bias and LayerNorm weights terms are exempt from this; here’s how we can do this:

接下来,我们对参数进行分组,以便优化器知道哪些参数将获得额外的权重衰减。通常,所有偏移和层标准权重项都不受此影响;以下是我们如何做到这一点的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
weight_decay = 0.1


def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
params_with_wd, params_without_wd = [], []
for n, p in model.named_parameters():
if any(nd in n for nd in no_decay):
params_without_wd.append(p)
else:
params_with_wd.append(p)
return [
{"params": params_with_wd, "weight_decay": weight_decay},
{"params": params_without_wd, "weight_decay": 0.0},
]

Since we want to evaluate the model regularly on the validation set during training, let’s write a function for that as well. It just runs through the evaluation dataloader and gathers all the losses across processes:

由于我们希望在训练期间定期在验证集上评估模型,因此我们也为此编写一个函数。它只需运行评估数据加载器,并收集跨进程的所有损失:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def evaluate():
model.eval()
losses = []
for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
outputs = model(batch["input_ids"], labels=batch["input_ids"])

losses.append(accelerator.gather(outputs.loss))
loss = torch.mean(torch.cat(losses))
try:
perplexity = torch.exp(loss)
except OverflowError:
perplexity = float("inf")
return loss.item(), perplexity.item()

With the evaluate() function we can report loss and perplexity at regular intervals. Next, we redefine our model to make sure we train from scratch again:

使用`valuate()‘函数,我们可以定期报告丢失和困惑。接下来,我们重新定义我们的模型,以确保再次从头开始训练:

1
model = GPT2LMHeadModel(config)

We can then define our optimizer, using the function from before to split the parameters for weight decay:

然后,我们可以定义优化器,使用前面的函数拆分权重衰减的参数:

1
2
3
from torch.optim import AdamW

optimizer = AdamW(get_grouped_params(model), lr=5e-4)

Now let’s prepare the model, optimizer, and dataloaders so we can start training:

现在,让我们准备好模型、优化器和数据加载器,以便开始培训:

1
2
3
4
5
6
7
from accelerate import Accelerator

accelerator = Accelerator(fp16=True)

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)

🚨 If you’re training on a TPU, you’ll need to move all the code starting at the cell above into a dedicated training function. See [Chapter 3] for more details.

🚨如果你正在训练一个TPU,你需要把从上面的单元开始的所有代码移到一个专用的训练功能中。有关详细信息,请参阅第3章。

Now that we have sent our train_dataloader to accelerator.prepare(), we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change its length. We use a classic linear schedule from the learning rate to 0:

既然我们已经将Train_dataloader发送到了accelerator.preparate(),我们就可以使用它的长度来计算训练步数了。请记住,我们应该始终在准备数据读取器之后执行此操作,因为该方法将更改其长度。我们使用从学习率到0的经典线性时间表:

1
2
3
4
5
6
7
8
9
10
11
12
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=1_000,
num_training_steps=num_training_steps,
)

Lastly, to push our model to the Hub, we will need to create a Repository object in a working folder. First log in to the Hugging Face Hub, if you aren’t logged in already. We’ll determine the repository name from the model ID we want to give our model (feel free to replace the repo_name with your own choice; it just needs to contain your username, which is what the function get_full_repo_name() does):

最后,为了将我们的模型推送到Hub,我们需要在一个工作文件夹中创建一个Repository对象。首先登录Hugging Face中心,如果你还没有登录的话。我们将根据我们想要为我们的模型提供的模型ID来确定存储库名称(您可以随意将repo_name替换为您自己的选择;它只需要包含您的用户名,这就是函数get_Full_repo_name()所做的):

1
2
3
4
5
from huggingface_hub import Repository, get_full_repo_name

model_name = "codeparrot-ds-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name
1
'sgugger/codeparrot-ds-accelerate'

Then we can clone that repository in a local folder. If it already exists, this local folder should be an existing clone of the repository we are working with:

然后,我们可以在本地文件夹中克隆该存储库。如果它已经存在,则此本地文件夹应该是我们正在使用的存储库的现有克隆:

1
2
output_dir = "codeparrot-ds-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

We can now upload anything we save in output_dir by calling the repo.push_to_hub() method. This will help us upload the intermediate models at the end of each epoch.

现在,我们可以通过调用repo.ush_to_Hub()方法上传保存在out_dir中的任何内容。这将帮助我们在每个时代结束时上传中间模型。

Before we train, let’s run a quick test to see if the evaluation function works properly:

在我们培训之前,让我们运行一个快速测试,以查看评估函数是否正常工作:

1
evaluate()
1
(10.934126853942871, 56057.14453125)

Those are very high values for loss and perplexity, but that’s not surprising as we haven’t trained the model yet. With that, we have everything prepared to write the core part of the training script: the training loop. In the training loop we iterate over the dataloader and pass the batches to the model. With the logits, we can then evaluate our custom loss function. We scale the loss by the number of gradient accumulation steps so as not to create larger losses when aggregating more steps. Before we optimize, we also clip the gradients for better convergence. Finally, every few steps we evaluate the model on the evaluation set with our new evaluate() function:

对于损失和困惑来说,这些都是非常高的值,但这并不令人惊讶,因为我们还没有训练过模型。这样,我们就准备好了编写培训脚本的核心部分:培训循环。在训练循环中,我们迭代数据加载器并将批处理传递给模型。有了Logit,我们就可以评估我们的客户损失函数。我们通过梯度累积步数来衡量损失,这样当聚集更多的步数时,就不会造成更大的损失。在进行优化之前,我们还会对梯度进行裁剪,以便更好地收敛。最后,每隔几个步骤,我们就使用新的valuate()函数对评估集上的模型进行评估:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 5_000

model.train()
completed_steps = 0
for epoch in range(num_train_epochs):
for step, batch in tqdm(
enumerate(train_dataloader, start=1), total=num_training_steps
):
logits = model(batch["input_ids"]).logits
loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)
if step % 100 == 0:
accelerator.print(
{
"lr": get_lr(),
"samples": step * samples_per_step,
"steps": completed_steps,
"loss/train": loss.item() * gradient_accumulation_steps,
}
)
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
completed_steps += 1
if (step % (eval_steps * gradient_accumulation_steps)) == 0:
eval_loss, perplexity = evaluate()
accelerator.print({"loss/eval": eval_loss, "perplexity": perplexity})
model.train()
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
tokenizer.save_pretrained(output_dir)
repo.push_to_hub(
commit_message=f"Training in progress step {step}", blocking=False
)

And that’s it — you now have your own custom training loop for causal language models such as GPT-2 that you can further customize to your needs.

仅此而已-您现在有了自己的因果语言模型(如GPT-2)的定制培训循环,您可以进一步根据需要进行定制。

✏️ Try it out! Either create your own custom loss function tailored to your use case, or add another custom step into the training loop.

✏️试试看吧!或者根据您的用例创建您自己的定制损失函数,或者在训练循环中添加另一个定制步骤。

✏️ Try it out! When running long training experiments it’s a good idea to log important metrics using tools such as TensorBoard or Weights & Biases. Add proper logging to the training loop so you can always check how the training is going.

✏️试试看吧!在运行长时间的训练实验时,最好使用TensorBoard或Weight&Biase等工具记录重要的指标。将适当的日志记录添加到培训循环中,这样您就可以始终检查培训进行得如何。