3-Fine-tuning_a_pretrained_model-3-A_full_training

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter3/4?fw=pt

A full training

全面的培训

Ask a Question
Open In Colab
Open In Studio Lab

在工作室实验室的可乐公开赛中提问

Now we’ll see how to achieve the same results as we did in the last section without using the Trainer class. Again, we assume you have done the data processing in section 2. Here is a short summary covering everything you will need:

现在,我们将了解如何在不使用Traine类的情况下实现与上一节相同的结果。同样,我们假设您已经完成了第2节中的数据处理。以下是您需要的所有内容的简短摘要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Prepare for training

为培训做准备

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

在实际编写我们的训练循环之前,我们需要定义几个对象。第一个是我们将用来迭代批处理的数据加载器。但是在我们可以定义这些数据加载器之前,我们需要对我们的tokenalized_Dataset‘进行一些后处理,以处理Traine`自动为我们做的一些事情。具体地说,我们需要:

  • Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
  • Rename the column label to labels (because the model expects the argument to be named labels).
  • Set the format of the datasets so they return PyTorch tensors instead of lists.

Our tokenized_datasets has one method for each of those steps:

删除与模型不期望的值对应的列(如senence1senence2列)。将列Label重命名为Labels(因为模型希望参数命名为Labels)。设置数据集的格式,以便它们返回PyTorch张量而不是列表。我们的tokenalized_datets为每个步骤都有一个方法:

1
2
3
4
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

We can then check that the result only has columns that our model will accept:

然后,我们可以检查结果是否只包含我们的模型可以接受的列:

1
["attention_mask", "input_ids", "labels", "token_type_ids"]

Now that this is done, we can easily define our dataloaders:

现在,我们可以轻松定义我们的数据加载器:

1
2
3
4
5
6
7
8
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can inspect a batch like this:

为了快速检查数据处理中是否没有错误,我们可以像这样检查一批:

1
2
3
for batch in train_dataloader:
break
{k: v.shape for k, v in batch.items()}
1
2
3
4
{'attention_mask': torch.Size([8, 65]),
'input_ids': torch.Size([8, 65]),
'labels': torch.Size([8]),
'token_type_ids': torch.Size([8, 65])}

Note that the actual shapes will probably be slightly different for you since we set shuffle=True for the training dataloader and we are padding to the maximum length inside the batch.

请注意,您的实际形状可能会略有不同,因为我们为训练数据加载器设置了Shuffle=True,并且填充到了批次内的最大长度。

Now that we’re completely finished with data preprocessing (a satisfying yet elusive goal for any ML practitioner), let’s turn to the model. We instantiate it exactly as we did in the previous section:

现在我们已经完全完成了数据预处理(对于任何ML实践者来说,这是一个令人满意但又难以实现的目标),让我们转向模型。我们完全按照上一节中的操作实例化它:

1
2
3
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

To make sure that everything will go smoothly during training, we pass our batch to this model:

为了确保训练期间一切顺利,我们将我们的批次传递给这个模型:

1
2
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
1
tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])

All 🤗 Transformers models will return the loss when labels are provided, and we also get the logits (two for each input in our batch, so a tensor of size 8 x 2).

当提供Label时,所有的🤗Transformers模型都将返回损失,我们还会得到Logit(批次中的每个输入两个,因此是一个大小为8x2的张量)。

We’re almost ready to write our training loop! We’re just missing two things: an optimizer and a learning rate scheduler. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults. The optimizer used by the Trainer is AdamW, which is the same as Adam, but with a twist for weight decay regularization (see “Decoupled Weight Decay Regularization” by Ilya Loshchilov and Frank Hutter):

我们几乎准备好编写我们的训练循环了!我们只是遗漏了两样东西:一个优化器和一个学习速率调度器。由于我们正试图复制‘Traine手动执行的操作,因此我们将使用相同的默认设置。Traine使用的优化器是AdamW`,与Adam相同,但在权重衰减正则化方面有所不同(参见Ilya Loshchiov和Frank Hutter的《解耦权重衰减正则化》):

1
2
3
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that:

最后,默认使用的学习速率调度器只是从最大值(5e-5)到0的线性衰减。为了正确地定义它,我们需要知道我们将采取的训练步骤的数量,即我们想要运行的历元的数量乘以训练批次的数量(这是我们的训练数据加载器的长度)。Traine默认使用三个纪元,因此我们将遵循这一点:

1
2
3
4
5
6
7
8
9
10
11
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
print(num_training_steps)
1
1377

The training loop

训练循环

One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we define a device we will put our model and our batches on:

最后一件事:如果我们可以使用GPU,我们会想要使用它(在CPU上,培训可能需要几个小时,而不是几分钟)。为此,我们定义了一个‘device’,我们将把我们的模型和批次放在:

1
2
3
4
5
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device
1
device(type='cuda')

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

我们现在准备好训练了!为了了解培训何时结束,我们使用tqdm库在我们的培训步骤数量上添加了一个进度条:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

You can see that the core of the training loop looks a lot like the one in the introduction. We didn’t ask for any reporting, so this training loop will not tell us anything about how the model fares. We need to add an evaluation loop for that.

您可以看到,培训循环的核心看起来与简介中的非常相似。我们没有要求任何报告,所以这个培训循环不会告诉我们任何关于该模型的进展情况。我们需要为此添加一个评估循环。

The evaluation loop

评估循环

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We’ve already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch(). Once we have accumulated all the batches, we can get the final result with metric.compute(). Here’s how to implement all of this in an evaluation loop:

正如我们前面所做的,我们将使用🤗评估库提供的指标。我们已经看到了metric.culate()方法,但是当我们使用addBatch()方法遍历预测循环时,指标实际上可以为我们累积批次。一旦我们累加了所有的批次,我们就可以使用metric.count()得到最终结果。以下是如何在求值循环中实现所有这些内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()
1
{'accuracy': 0.8431372549019608, 'f1': 0.8907849829351535}

Again, your results will be slightly different because of the randomness in the model head initialization and the data shuffling, but they should be in the same ballpark.

同样,由于模型头部初始化和数据洗牌中的随机性,您的结果将略有不同,但它们应该是相同的。

✏️ Try it out! Modify the previous training loop to fine-tune your model on the SST-2 dataset.

✏️试试看吧!修改上一个训练循环以微调SST-2数据集上的模型。

Supercharge your training loop with 🤗 Accelerate

使用🤗Accelerate为您的训练循环增压

The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like:

我们之前定义的训练循环在单个CPU或GPU上运行良好。但使用🤗Accelerate库,只需进行一些调整,我们就可以在多个GPU或TPU上实现分布式培训。从创建训练和验证数据加载器开始,以下是我们的手动训练循环:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

And here are the changes:

以下是变化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
+ from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+ train_dataloader, eval_dataloader, model, optimizer
+ )

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
- batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
- loss.backward()
+ accelerator.backward(loss)

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

The first line to add is the import line. The second line instantiates an Accelerator object that will look at the environment and initialize the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use accelerator.device instead of device).

要添加的第一行是导入行。第二行实例化一个Accelerator‘对象,该对象将查看环境并初始化正确的分布式设置。🤗Accelerate为您处理设备放置,因此您可以删除将模型放置在设备上的行(或者,如果您愿意,可以将它们更改为使用accelerator.device而不是device`)。

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare(). This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and replacing loss.backward() with accelerator.backward(loss).

然后,主要的工作在将数据加载器、模型和优化器发送到accelerator.prepare()的那一行中完成。这将把这些对象包装在适当的容器中,以确保您的分布式培训按预期工作。要做的其余更改是删除将批次放在device上的行(同样,如果您想保留它,只需将其更改为使用accelerator.device),并将loss.back()替换为accelerator.back(Oss)

⚠️ In order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the padding="max_length" and max_length arguments of the tokenizer.

⚠️为了从云TPU的加速中获益,我们建议您使用标记器的padding=“max_length”max_ength`参数将您的样本填充到固定长度。

If you’d like to copy and paste it to play around, here’s what the complete training loop looks like with 🤗 Accelerate:

如果你想复制并粘贴它来玩耍,以下是🤗Accelerate的完整训练循环:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
for batch in train_dl:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

将此代码放入一个Train.py脚本将使该脚本可在任何类型的分布式安装程序上运行。要在您的分布式设置中试用它,请运行命令:

1
accelerate config

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

这将提示您回答几个问题并将您的答案转储到此命令使用的配置文件中:

1
accelerate launch train.py

which will launch the distributed training.

它将启动分布式培训。

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a training_function() and run a last cell with:

如果您想在笔记本中尝试这一点(例如,用Colab上的TPU测试它),只需将代码粘贴到一个Trading_Function()中,并用以下命令运行最后一个单元格:

1
2
3
from accelerate import notebook_launcher

notebook_launcher(training_function)

You can find more examples in the 🤗 Accelerate repo.

您可以在🤗Accelerate回购中找到更多示例。