8-How_to_ask_for_help-3-Debugging_the_training_pipeline

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter8/4?fw=pt

Debugging the training pipeline

调试培训流水线

Ask a Question
Open In Colab
Open In Studio Lab
You’ve written a beautiful script to train or fine-tune a model on a given task, dutifully following the advice from [Chapter 7]. But when you launch the command trainer.train(), something horrible happens: you get an error 😱! Or worse, everything seems to be fine and the training runs without error, but the resulting model is crappy. In this section, we will show you what you can do to debug these kinds of issues.

你已经编写了一个漂亮的脚本,按照第7章的建议训练或微调一名模特的任务。但是当你启动命令😱()时,可怕的事情发生了:你得到了一个错误Traine.Train!或者更糟糕的是,一切似乎都很好,训练运行没有错误,但结果模型很糟糕。在本节中,我们将向您展示如何调试这类问题。

Debugging the training pipeline

调试培训流水线

The problem when you encounter an error in trainer.train() is that it could come from multiple sources, as the Trainer usually puts together lots of things. It converts datasets to dataloaders, so the problem could be something wrong in your dataset, or some issue when trying to batch elements of the datasets together. Then it takes a batch of data and feeds it to the model, so the problem could be in the model code. After that, it computes the gradients and performs the optimization step, so the problem could also be in your optimizer. And even if everything goes well for training, something could still go wrong during the evaluation if there is a problem with your metric.

当你在traine.Train()中遇到错误时,问题在于它可能来自多个来源,因为Traine通常会将很多东西组合在一起。它将数据集转换为数据加载器,因此问题可能是数据集中的错误,或者是试图将数据集的元素批处理在一起时出现的问题。然后,它获取一批数据并将其提供给模型,因此问题可能出在模型代码中。之后,它将计算梯度并执行优化步骤,因此问题也可能出现在您的优化器中。而且,即使训练进行得很顺利,如果您的指标有问题,在评估过程中仍可能出现问题。

The best way to debug an error that arises in trainer.train() is to manually go through this whole pipeline to see where things went awry. The error is then often very easy to solve.

调试traine.Train()中出现的错误的最好方法是手动检查整个管道,以查看哪里出了问题。然后,错误通常很容易解决。

To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the MNLI dataset:

为了演示这一点,我们将使用以下脚本来(尝试)微调MNLI数据集上的DistilBERT模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)

raw_datasets = load_dataset("glue", "mnli")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)


tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

args = TrainingArguments(
f"distilbert-finetuned-mnli",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
)

metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
model,
args,
train_dataset=raw_datasets["train"],
eval_dataset=raw_datasets["validation_matched"],
compute_metrics=compute_metrics,
)
trainer.train()

If you try to execute it, you will be met with a rather cryptic error:

如果您尝试执行它,您将遇到一个相当隐晦的错误:

1
'ValueError: You have to specify either input_ids or inputs_embeds'

Check your data

检查您的数据

This goes without saying, but if your data is corrupted, the Trainer is not going to be able to form batches, let alone train your model. So first things first, you need to have a look at what is inside your training set.

这是不言而喻的,但如果您的数据被破坏,Traine将无法形成批次,更不用说训练您的模型了。所以,首先,你需要看看你的训练集里有什么。

To avoid countless hours spent trying to fix something that is not the source of the bug, we recommend you use trainer.train_dataset for your checks and nothing else. So let’s do that here:

为了避免花费数不清的时间试图修复不是错误来源的东西,我们建议您使用traine.train_datet进行检查,而不使用其他工具。因此,让我们在这里这样做:

1
trainer.train_dataset[0]
1
2
3
4
{'hypothesis': 'Product and geography are what make cream skimming work. ',
'idx': 0,
'label': 1,
'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}

Do you notice something wrong? This, in conjunction with the error message about input_ids missing, should make you realize those are texts, not numbers the model can make sense of. Here, the original error is very misleading because the Trainer automatically removes the columns that don’t match the model signature (that is, the arguments expected by the model). That means here, everything apart from the labels was discarded. There was thus no issue with creating batches and then sending them to the model, which in turn complained it didn’t receive the proper input.

你有没有注意到有什么不对劲?这一点,再加上关于缺少inputids的错误消息,应该会让您意识到这些是文本,而不是该模型可以理解的数字。这里,原来的错误具有很强的误导性,因为Traine会自动删除与模型签名不匹配的列(即模型期望的参数)。这意味着在这里,除了标签之外的一切都被丢弃了。因此,创建批次然后将它们发送到模型没有问题,模型反过来又抱怨它没有收到正确的输入。

Why wasn’t the data processed? We did use the Dataset.map() method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the Trainer. Instead of using tokenized_datasets here, we used raw_datasets 🤦. So let’s fix this!

为什么没有对数据进行处理?我们确实在数据集上使用了Dataset.map()方法,以便在每个样本上应用记号赋值器。但如果你仔细观察代码,你会发现我们在将训练和评估集传递给Traine时犯了一个错误。这里使用的是Raw_Datasets🤦,而不是tokenalized_datets。所以让我们来解决这个问题吧!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)

raw_datasets = load_dataset("glue", "mnli")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)


tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

args = TrainingArguments(
f"distilbert-finetuned-mnli",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
)

metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation_matched"],
compute_metrics=compute_metrics,
)
trainer.train()

This new code will now give a different error (progress!):

这段新代码现在将显示不同的错误(进度!):

1
'ValueError: expected sequence of length 43 at dim 1 (got 37)'

Looking at the traceback, we can see the error happens in the data collation step:

查看回溯,我们可以看到数据整理步骤中发生的错误:

1
2
3
4
5
6
~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
105 batch[k] = torch.stack([f[k] for f in features])
106 else:
--> 107 batch[k] = torch.tensor([f[k] for f in features])
108
109 return batch

So, we should move to that. Before we do, however, let’s finish inspecting our data, just to be 100% sure it’s correct.

因此,我们应该转向这一点。然而,在我们这样做之前,让我们检查完我们的数据,只是为了100%确保它是正确的。

One thing you should always do when debugging a training session is have a look at the decoded inputs of your model. We can’t make sense of the numbers that we feed it directly, so we should look at what those numbers represent. In computer vision, for example, that means looking at the decoded pictures of the pixels you pass, in speech it means listening to the decoded audio samples, and for our NLP example here it means using our tokenizer to decode the inputs:

在调试培训课程时,您应该始终做的一件事是查看模型的解码输入。我们无法理解我们直接提供给它的数字,所以我们应该看看这些数字代表了什么。例如,在计算机视觉中,这意味着查看您传递的像素的解码图像,在语音中,它意味着收听解码的音频样本,对于我们的NLP示例,这里的意思是使用我们的令牌器来解码输入:

1
tokenizer.decode(trainer.train_dataset[0]["input_ids"])
1
'[CLS] conceptually cream skimming has two basic dimensions - product and geography. [SEP] product and geography are what make cream skimming work. [SEP]'

So that seems correct. You should do this for all the keys in the inputs:

因此,这似乎是正确的。您应该对输入中的所有关键点执行此操作:

1
trainer.train_dataset[0].keys()
1
dict_keys(['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'])

Note that the keys that don’t correspond to inputs accepted by the model will be automatically discarded, so here we will only keep input_ids, attention_mask, and label (which will be renamed labels). To double-check the model signature, you can print the class of your model, then go check its documentation:

需要注意的是,与模型接受的输入不对应的key会被自动丢弃,所以这里我们只保留input_idsnote_masklabel(将重命名为Labels)。要再次检查模型签名,可以打印模型的类,然后查看其文档:

1
type(trainer.model)
1
transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

So in our case, we can check the parameters accepted on this page. The Trainer will also log the columns it’s discarding.

因此,在我们的情况下,我们可以检查此页面上接受的参数。“火车”还将记录它正在丢弃的列。

We have checked that the input IDs are correct by decoding them. Next is the attention_mask:

我们已经通过解码来检查输入的ID是否正确。下一步是NOTION_MASK

1
trainer.train_dataset[0]["attention_mask"]
1
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Since we didn’t apply padding in our preprocessing, this seems perfectly natural. To be sure there is no issue with that attention mask, let’s check it is the same length as our input IDs:

由于我们在预处理过程中没有应用填充,这似乎是非常自然的。为了确保注意掩码没有问题,让我们检查它与我们的输入ID的长度是否相同:

1
2
3
len(trainer.train_dataset[0]["attention_mask"]) == len(
trainer.train_dataset[0]["input_ids"]
)
1
True

That’s good! Lastly, let’s check our label:

那很好!最后,让我们检查一下我们的标签:

1
trainer.train_dataset[0]["label"]
1
1

Like the input IDs, this is a number that doesn’t really make sense on its own. As we saw before, the map between integers and label names is stored inside the names attribute of the corresponding feature of the dataset:

与输入ID一样,这是一个本身没有意义的数字。正如我们之前看到的,整数和标签名称之间的映射存储在DataSet的相应特征的nam属性中:

1
trainer.train_dataset.features["label"].names
1
['entailment', 'neutral', 'contradiction']

So 1 means neutral, which means the two sentences we saw above are not in contradiction, and the first one does not imply the second one. That seems correct!

所以`1‘的意思是’中立‘,也就是说我们上面看到的两个句子并不矛盾,第一个句子并不意味着第二个句子。这似乎是对的!

We don’t have token type IDs here, since DistilBERT does not expect them; if you have some in your model, you should also make sure that they properly match where the first and second sentences are in the input.

我们这里没有令牌类型ID,因为DistilBERT不需要它们;如果您的模型中有一些,您还应该确保它们与输入中第一和第二句话的位置正确匹配。

✏️ Your turn! Check that everything seems correct with the second element of the training dataset.

✏️轮到你了!检查训练数据集的第二个元素是否正确。

We are only doing the check on the training set here, but you should of course double-check the validation and test sets the same way.

我们在这里只是对训练集进行检查,但您当然应该以相同的方式重新检查验证集和测试集。

Now that we know our datasets look good, it’s time to check the next step of the training pipeline.

现在我们知道我们的数据集看起来很好,是时候检查训练管道的下一步了。

From datasets to dataloaders

从数据集到数据加载器

The next thing that can go wrong in the training pipeline is when the Trainer tries to form batches from the training or validation set. Once you are sure the Trainer’s datasets are correct, you can try to manually form a batch by executing the following (replace train with eval for the validation dataloader):

训练流水线中的下一件事可能会出错,那就是当“训练”试图从训练或验证集形成批次时。一旦您确定Traine的数据集是正确的,您可以尝试通过执行以下命令(将验证数据加载器的Train替换为val)来手动形成批:

1
2
for batch in trainer.get_train_dataloader():
break

This code creates the training dataloader, then iterates through it, stopping at the first iteration. If the code executes without error, you have the first training batch that you can inspect, and if the code errors out, you know for sure the problem is in the dataloader, as is the case here:

这段代码创建训练数据加载器,然后遍历它,在第一次迭代时停止。如果代码执行时没有错误,您就有了第一批可以检查的训练批次,如果代码出错,您肯定知道问题出在数据加载器中,如下所示:

1
2
3
4
5
6
7
8
~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
105 batch[k] = torch.stack([f[k] for f in features])
106 else:
--> 107 batch[k] = torch.tensor([f[k] for f in features])
108
109 return batch

ValueError: expected sequence of length 45 at dim 1 (got 76)

Inspecting the last frame of the traceback should be enough to give you a clue, but let’s do a bit more digging. Most of the problems during batch creation arise because of the collation of examples into a single batch, so the first thing to check when in doubt is what collate_fn your DataLoader is using:

检查回溯的最后一帧应该足以给您提供线索,但让我们做更多的挖掘。在批处理创建过程中,大多数问题都是由于将示例整理成一个批处理而产生的,所以如果有疑问,首先要检查的是您的DataLoader使用的Collate_fn是什么:

1
2
data_collator = trainer.get_train_dataloader().collate_fn
data_collator
1
<function transformers.data.data_collator.default_data_collator(features: List[InputDataClass], return_tensors='pt') -> Dict[str, Any]>

So this is the default_data_collator, but that’s not what we want in this case. We want to pad our examples to the longest sentence in the batch, which is done by the DataCollatorWithPadding collator. And this data collator is supposed to be used by default by the Trainer, so why is it not used here?

这就是DEFAULT_DATA_COLLATOR,但这不是我们在本例中想要的。我们希望将示例填充到批处理中最长的句子中,这是由DataCollatorWithPadding排序器完成的。而这个数据整理工具是Traine默认使用的,那么为什么这里不使用呢?

The answer is because we did not pass the tokenizer to the Trainer, so it couldn’t create the DataCollatorWithPadding we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let’s adapt our code to do exactly that:

答案是因为我们没有将tokenizer传递给Traine,所以它无法创建我们想要的DataCollatorWithPadding。在实践中,您应该毫不犹豫地显式传递要使用的数据校验器,以确保避免此类错误。让我们调整我们的代码来实现这一点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
DataCollatorWithPadding,
TrainingArguments,
Trainer,
)

raw_datasets = load_dataset("glue", "mnli")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)


tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

args = TrainingArguments(
f"distilbert-finetuned-mnli",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
)

metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation_matched"],
compute_metrics=compute_metrics,
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()

The good news? We don’t get the same error as before, which is definitely progress. The bad news? We get an infamous CUDA error instead:

好消息是?我们没有得到和以前一样的错误,这绝对是进步。坏消息是什么?我们得到了一个臭名昭著的CUDA错误:

1
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

This is bad because CUDA errors are extremely hard to debug in general. We will see in a minute how to solve this, but first let’s finish our analysis of batch creation.

这很糟糕,因为CUDA错误通常极难调试。我们将立即了解如何解决此问题,但首先让我们完成对批处理创建的分析。

If you are sure your data collator is the right one, you should try to apply it on a couple of samples of your dataset:

如果您确定您的数据校验器是正确的,您应该尝试将其应用于您的数据集的几个样本:

1
2
data_collator = trainer.get_train_dataloader().collate_fn
batch = data_collator([trainer.train_dataset[i] for i in range(4)])

This code will fail because the train_dataset contains string columns, which the Trainer usually removes. You can remove them manually, or if you want to replicate exactly what the Trainer is doing behind the scenes, you can call the private Trainer._remove_unused_columns() method that does that:

此代码将失败,因为train_datet包含字符串列,而Traine通常会删除这些列。您可以手动删除它们,或者如果您想要复制Traine在幕后所做的事情,您可以调用执行此操作的私有Trainer._Remove_Unused_Columns()方法:

1
2
3
data_collator = trainer.get_train_dataloader().collate_fn
actual_train_set = trainer._remove_unused_columns(trainer.train_dataset)
batch = data_collator([actual_train_set[i] for i in range(4)])

You should then be able to manually debug what happens inside the data collator if the error persists.

然后,如果错误仍然存在,您应该能够手动调试数据校验器内部发生的情况。

Now that we’ve debugged the batch creation process, it’s time to pass one through the model!

现在我们已经调试了批处理创建过程,是时候传递一个模型了!

Going through the model

浏览模型

You should be able to get a batch by executing the following command:

您应该能够通过执行以下命令来获得批处理:

1
2
for batch in trainer.get_train_dataloader():
break

If you’re running this code in a notebook, you may get a CUDA error that’s similar to the one we saw earlier, in which case you need to restart your notebook and reexecute the last snippet without the trainer.train() line. That’s the second most annoying thing about CUDA errors: they irremediably break your kernel. The most annoying thing about them is the fact that they are hard to debug.

如果您在笔记本电脑中运行此代码,您可能会收到与我们前面看到的类似的CUDA错误,在这种情况下,您需要重新启动笔记本电脑,并重新执行最后一个代码片段,而不使用traine.Train()行。这是CUDA错误的第二个最令人讨厌的事情:它们无可救药地破坏了您的内核。它们最令人讨厌的地方是它们很难调试。

Why is that? It has to do with the way GPUs work. They are extremely efficient at executing a lot of operations in parallel, but the drawback is that when one of those instructions results in an error, you don’t know it instantly. It’s only when the program calls a synchronization of the multiple processes on the GPU that it will realize something went wrong, so the error is actually raised at a place that has nothing to do with what created it. For instance, if we look at our previous traceback, the error was raised during the backward pass, but we will see in a minute that it actually stems from something in the forward pass.

为什么会这样呢?这与GPU的工作方式有关。它们在并行执行大量操作方面非常高效,但缺点是,当其中一条指令导致错误时,您不能立即知道它。只有当程序在GPU上调用多个进程的同步时,它才会意识到出了问题,所以错误实际上是在一个与创建它的原因无关的地方引发的。例如,如果我们查看之前的回溯,错误是在向后传球过程中引发的,但我们立即会发现,它实际上源于向前传球中的某些东西。

So how do we debug those errors? The answer is easy: we don’t. Unless your CUDA error is an out-of-memory error (which means there is not enough memory in your GPU), you should always go back to the CPU to debug it.

那么,我们如何调试这些错误呢?答案很简单:我们不会。除非你的CUDA错误是内存不足错误(这意味着你的GPU内存不足),否则你应该总是回到CPU进行调试。

To do this in our case, we just have to put the model back on the CPU and call it on our batch — the batch returned by the DataLoader has not been moved to the GPU yet:

在我们的例子中,要做到这一点,我们只需将模型放回CPU并在批处理上调用它-由DataLoader返回的批处理尚未移动到GPU:

1
outputs = trainer.model.cpu()(**batch)
1
2
3
4
5
6
7
8
~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
2386 )
2387 if dim == 2:
-> 2388 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
2389 elif dim == 4:
2390 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

IndexError: Target 2 is out of bounds.

So, the picture is getting clearer. Instead of having a CUDA error, we now have an IndexError in the loss computation (so nothing to do with the backward pass, as we said earlier). More precisely, we can see that it’s target 2 that creates the error, so this is a very good moment to check the number of labels of our model:

因此,情况正在变得更加清晰。我们现在在损失计算中有一个`IndexError‘,而不是CUDA错误(因此,正如我们前面所说的,与向后传递没有任何关系)。更准确地说,我们可以看到是目标2产生了错误,所以现在是检查我们模型的标签数量的好时机:

1
trainer.model.config.num_labels
1
2

With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn’t tell that to our model, which should have been created with three labels. So let’s fix that!

对于两个标签,只允许0和1作为目标,但根据错误消息,我们得到了2。获得2实际上是正常的:如果我们还记得前面提取的标签名称,那么有三个,所以我们的数据集中有索引0、1和2。问题是我们没有告诉我们的模型,它应该是用三个标签创建的。所以让我们来解决这个问题吧!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
DataCollatorWithPadding,
TrainingArguments,
Trainer,
)

raw_datasets = load_dataset("glue", "mnli")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)


tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

args = TrainingArguments(
f"distilbert-finetuned-mnli",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
)

metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation_matched"],
compute_metrics=compute_metrics,
data_collator=data_collator,
tokenizer=tokenizer,
)

We aren’t including the trainer.train() line yet, to take the time to check that everything looks good. If we request a batch and pass it to our model, it now works without error!

我们还没有包括`traine.Train()‘行,以便花时间检查一切看起来是否正常。如果我们请求一个批处理并将其传递给我们的模型,它现在就可以正常工作了!

1
2
3
4
for batch in trainer.get_train_dataloader():
break

outputs = trainer.model.cpu()(**batch)

The next step is then to move back to the GPU and check that everything still works:

然后,下一步是返回到GPU,并检查一切是否仍然正常:

1
2
3
4
5
6
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: v.to(device) for k, v in batch.items()}

outputs = trainer.model.to(device)(**batch)

If you still get an error, make sure you restart your notebook and only execute the last version of the script.

如果仍然出现错误,请确保重新启动笔记本,并且只执行脚本的最新版本。

Performing one optimization step

执行一个优化步骤

Now that we know that we can build batches that actually go through the model, we are ready for the next step of the training pipeline: computing the gradients and performing an optimization step.

既然我们知道我们可以构建实际通过模型的批处理,我们就为训练管道的下一步做好了准备:计算梯度并执行优化步骤。

The first part is just a matter of calling the backward() method on the loss:

第一部分只是针对损失调用back()方法的问题:

1
2
loss = outputs.loss
loss.backward()

It’s pretty rare to get an error at this stage, but if you do get one, make sure to go back to the CPU to get a helpful error message.

在这个阶段出现错误是非常罕见的,但如果您确实收到了错误,请确保返回到CPU以获得有用的错误消息。

To perform the optimization step, we just need to create the optimizer and call its step() method:

要执行优化步骤,我们只需要创建Optimizer并调用它的Step()方法:

1
2
trainer.create_optimizer()
trainer.optimizer.step()

Again, if you’re using the default optimizer in the Trainer, you shouldn’t get an error at this stage, but if you have a custom optimizer, there might be some problems to debug here. Don’t forget to go back to the CPU if you get a weird CUDA error at this stage. Speaking of CUDA errors, earlier we mentioned a special case. Let’s have a look at that now.

同样,如果您正在使用Traine中的默认优化器,您在这个阶段应该不会收到错误,但是如果您有一个定制的优化器,在这里进行调试可能会有一些问题。如果在这个阶段收到奇怪的CUDA错误,不要忘记返回到CPU。说到CUDA错误,前面我们提到了一个特例。现在让我们来看看这个。

Dealing with CUDA out-of-memory errors

处理CUDA内存不足错误

Whenever you get an error message that starts with RuntimeError: CUDA out of memory, this indicates that you are out of GPU memory. This is not directly linked to your code, and it can happen with a script that runs perfectly fine. This error means that you tried to put too many things in the internal memory of your GPU, and that resulted in an error. Like with other CUDA errors, you will need to restart your kernel to be in a spot where you can run your training again.

当收到以`RounmeError:Cuda out out of Memory‘开头的错误消息时,说明您的GPU内存不足。这并不直接链接到您的代码,而且它可能发生在运行得非常好的脚本中。此错误意味着您试图在GPU的内部内存中放入太多内容,这导致了错误。与其他CUDA错误一样,您需要重新启动内核才能再次运行培训。

To solve this issue, you just need to use less GPU space — something that is often easier said than done. First, make sure you don’t have two models on the GPU at the same time (unless that’s required for your problem, of course). Then, you should probably reduce your batch size, as it directly affects the sizes of all the intermediate outputs of the model and their gradients. If the problem persists, consider using a smaller version of your model.

要解决这个问题,你只需要使用更少的GPU空间–这往往是说起来容易做起来难的事情。首先,确保您的GPU上没有同时安装两个型号(当然,除非这是您的问题所必需的)。然后,您可能应该减少批处理大小,因为它直接影响模型的所有中间输出及其渐变的大小。如果问题仍然存在,请考虑使用较小版本的模型。

In the next part of the course, we’ll look at more advanced techniques that can help you reduce your memory footprint and let you fine-tune the biggest models.

在本课程的下一部分中,我们将介绍更高级的技术,这些技术可以帮助您减少内存占用,并允许您微调最大的型号。

Evaluating the model

对模型进行评估

Now that we’ve solved all the issues with our code, everything is perfect and the training should run smoothly, right? Not so fast! If you run the trainer.train() command, everything will look good at first, but after a while you will get the following:

现在我们已经用代码解决了所有问题,一切都很完美,培训应该顺利进行,对吗?还没那么快!如果您运行traine.Train()命令,起初一切看起来都很好,但一段时间后,您将看到以下内容:

1
2
# This will take a long time and error out, so you shouldn't run this cell
trainer.train()
1
TypeError: only size-1 arrays can be converted to Python scalars

You will realize this error appears during the evaluation phase, so this is the last thing we will need to debug.

您将意识到此错误出现在评估阶段,因此这是我们最后需要调试的内容。

You can run the evaluation loop of the Trainer independently form the training like this:

您可以独立于培训运行Traine的评估循环,如下所示:

1
trainer.evaluate()
1
TypeError: only size-1 arrays can be converted to Python scalars

💡 You should always make sure you can run trainer.evaluate() before launching trainer.train(), to avoid wasting lots of compute resources before hitting an error.

💡在启动traine.Train()之前,请务必确保能够运行traine.valuate(),以免在出错前浪费大量的计算资源。

Before attempting to debug a problem in the evaluation loop, you should first make sure that you’ve had a look at the data, are able to form a batch properly, and can run your model on it. We’ve completed all of those steps, so the following code can be executed without error:

在尝试调试评估循环中的问题之前,您应该首先确保您已经查看了数据,能够正确地形成批处理,并且可以在上面运行您的模型。我们已经完成了所有这些步骤,因此可以正确地执行以下代码:

1
2
3
4
5
6
7
for batch in trainer.get_eval_dataloader():
break

batch = {k: v.to(device) for k, v in batch.items()}

with torch.no_grad():
outputs = trainer.model(**batch)

The error comes later, at the end of the evaluation phase, and if we look at the traceback we see this:

错误出现在稍后的评估阶段结束时,如果我们查看回溯,我们会看到以下内容:

1
2
3
4
5
6
~/git/datasets/src/datasets/metric.py in add_batch(self, predictions, references)
431 """
432 batch = {"predictions": predictions, "references": references}
--> 433 batch = self.info.features.encode_batch(batch)
434 if self.writer is None:
435 self._init_writer()

This tells us that the error originates in the datasets/metric.py module — so this is a problem with our compute_metrics() function. It takes a tuple with the logits and the labels as NumPy arrays, so let’s try to feed it that:

这告诉我们错误起源于Dataets/metric.py模块–所以这是我们的Compute_Metrics()函数的问题。它接受一个带有Logit和标签的元组作为NumPy数组,因此让我们尝试向它提供:

1
2
3
4
predictions = outputs.logits.cpu().numpy()
labels = batch["labels"].cpu().numpy()

compute_metrics((predictions, labels))
1
TypeError: only size-1 arrays can be converted to Python scalars

We get the same error, so the problem definitely lies with that function. If we look back at its code, we see it’s just forwarding the predictions and the labels to metric.compute(). So is there a problem with that method? Not really. Let’s have a quick look at the shapes:

我们得到了相同的错误,所以问题肯定出在该函数上。如果我们回过头来看它的代码,我们会发现它只是将forectionsLabels转发给metric.count()。那么,这种方法有什么问题吗?不怎么有意思。让我们快速查看一下这些形状:

1
predictions.shape, labels.shape
1
((8, 3), (8,))

Our predictions are still logits, not the actual predictions, which is why the metric is returning this (somewhat obscure) error. The fix is pretty easy; we just have to add an argmax in the compute_metrics() function:

我们的预测仍然是Logit,而不是实际的预测,这就是该指标返回这个(有些模糊的)错误的原因。修复非常简单;我们只需在Compute_Metrics()函数中添加一个argmax:

1
2
3
4
5
6
7
8
9
10
import numpy as np


def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)


compute_metrics((predictions, labels))
1
{'accuracy': 0.625}

Now our error is fixed! This was the last one, so our script will now train a model properly.

现在我们的错误被修正了!这是最后一个,所以我们的脚本现在将正确地训练一个模型。

For reference, here is the completely fixed script:

以下是完全固定的脚本,以供参考:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
DataCollatorWithPadding,
TrainingArguments,
Trainer,
)

raw_datasets = load_dataset("glue", "mnli")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


def preprocess_function(examples):
return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)


tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

args = TrainingArguments(
f"distilbert-finetuned-mnli",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01,
)

metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation_matched"],
compute_metrics=compute_metrics,
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()

In this instance, there are no more problems, and our script will fine-tune a model that should give reasonable results. But what can we do when the training proceeds without any error, and the model trained does not perform well at all? That’s the hardest part of machine learning, and we’ll show you a few techniques that can help.

在这种情况下,没有更多的问题,我们的脚本将微调一个应该提供合理结果的模型。但是,当训练进行得没有任何错误,而训练的模型表现得一点都不好时,我们能做什么呢?这是机器学习中最难的部分,我们将向您展示一些可以帮助您的技术。

💡 If you’re using a manual training loop, the same steps apply to debug your training pipeline, but it’s easier to separate them. Make sure you have not forgotten the model.eval() or model.train() at the right places, or the zero_grad() at each step, however!

💡如果您使用的是手动培训循环,同样的步骤也适用于调试您的培训管道,但将它们分开会更容易。但是,请确保您没有忘记正确位置上的Model.val()Model.Train(),或者每个步骤中的zero_grad()

Debugging silent errors during training

在培训期间调试静默错误

What can we do to debug a training that completes without error but doesn’t get good results? We’ll give you some pointers here, but be aware that this kind of debugging is the hardest part of machine learning, and there is no magical answer.

我们可以做些什么来调试一个培训,该培训完成时没有错误,但没有得到好的结果?我们在这里给你一些提示,但要知道,这种调试是机器学习中最难的部分,没有神奇的答案。

Check your data (again!)

检查您的数据(再次!)

Your model will only learn something if it’s actually possible to learn anything from your data. If there is a bug that corrupts the data or the labels are attributed randomly, it’s very likely you won’t get any model training on your dataset. So always start by double-checking your decoded inputs and labels, and ask yourself the following questions:

只有在可以从您的数据中学到任何东西的情况下,您的模型才会学到一些东西。如果存在损坏数据的错误,或者标签是随机分配的,则很可能不会对您的数据集进行任何模型训练。因此,总是从仔细检查您的解码输入和标签开始,并问自己以下问题:

  • Is the decoded data understandable?
  • Do you agree with the labels?
  • Is there one label that’s more common than the others?
  • What should the loss/metric be if the model predicted a random answer/always the same answer?

⚠️ If you are doing distributed training, print samples of your dataset in each process and triple-check that you get the same thing. One common bug is to have some source of randomness in the data creation that makes each process have a different version of the dataset.

解码的数据容易理解吗?你同意标签吗?有没有一个标签比其他标签更常见?如果模型预测的是随机答案/总是相同的答案,损失/衡量标准应该是多少?⚠️如果你正在进行分布式训练,在每个过程中打印你的数据集样本,并反复检查你得到的是相同的东西。一个常见的错误是在数据创建过程中有一些随机性来源,这使得每个过程都有不同版本的数据集。

After looking at your data, go through a few of the model’s predictions and decode them too. If the model is always predicting the same thing, it might be because your dataset is biased toward one category (for classification problems); techniques like oversampling rare classes might help.

看完数据后,再看一遍模型的一些预测,并对它们进行解码。如果模型总是预测相同的事情,这可能是因为您的数据集偏向于一个类别(对于分类问题);像过采样稀有类这样的技术可能会有所帮助。

If the loss/metric you get on your initial model is very different from the loss/metric you would expect for random predictions, double-check the way your loss or metric is computed, as there is probably a bug there. If you are using several losses that you add at the end, make sure they are of the same scale.

如果您在初始模型上得到的损失/指标与您对随机预测的预期损失/指标有很大不同,请仔细检查您的损失或指标的计算方式,因为那里可能存在错误。如果你使用的是在最后增加的几个损失,请确保它们的规模相同。

When you are sure your data is perfect, you can see if the model is capable of training on it with one simple test.

当您确定您的数据是完美的时,您可以通过一次简单的测试来查看模型是否能够对其进行训练。

Overfit your model on one batch

在一批产品上加装你的模型

Overfitting is usually something we try to avoid when training, as it means the model is not learning to recognize the general features we want it to but is instead just memorizing the training samples. However, trying to train your model on one batch over and over again is a good test to check if the problem as you framed it can be solved by the model you are attempting to train. It will also help you see if your initial learning rate is too high.

过度匹配通常是我们在训练时试图避免的事情,因为这意味着模型没有学习识别我们希望它识别的一般特征,而只是记住了训练样本。然而,尝试一次又一次地训练您的模型是一种很好的测试,可以检查您所框架的问题是否可以通过您试图训练的模型来解决。它还将帮助你查看你的初始学习速度是否太高。

Doing this once you have defined your Trainer is really easy; just grab a batch of training data, then run a small manual training loop only using that batch for something like 20 steps:

一旦你定义了你的训练,这真的很容易;只需抓取一批训练数据,然后只用这一批运行一个小的手动训练循环,大约20个步骤:

1
2
3
4
5
6
7
8
9
10
11
12
for batch in trainer.get_train_dataloader():
break

batch = {k: v.to(device) for k, v in batch.items()}
trainer.create_optimizer()

for _ in range(20):
outputs = trainer.model(**batch)
loss = outputs.loss
loss.backward()
trainer.optimizer.step()
trainer.optimizer.zero_grad()

💡 If your training data is unbalanced, make sure to build a batch of training data containing all the labels.

💡如果您的训练数据不平衡,请确保构建一批包含所有标签的训练数据。

The resulting model should have close-to-perfect results on the same batch. Let’s compute the metric on the resulting predictions:

由此产生的模型应该在同一批处理上具有近乎完美的结果。让我们根据结果预测来计算指标:

1
2
3
4
5
6
with torch.no_grad():
outputs = trainer.model(**batch)
preds = outputs.logits
labels = batch["labels"]

compute_metrics((preds.cpu().numpy(), labels.cpu().numpy()))
1
{'accuracy': 1.0}

100% accuracy, now this is a nice example of overfitting (meaning that if you try your model on any other sentence, it will very likely give you a wrong answer)!

100%的准确率,现在这是一个很好的过度拟合的例子(这意味着如果你在任何其他句子上尝试你的模型,它很可能给你一个错误的答案)!

If you don’t manage to have your model obtain perfect results like this, it means there is something wrong with the way you framed the problem or your data, so you should fix that. Only when you manage to pass the overfitting test can you be sure that your model can actually learn something.

如果您没有设法让您的模型获得这样的完美结果,这意味着您构建问题或数据的方式存在问题,因此您应该修复它。只有当你设法通过了过合度测试,你才能确保你的模型能够真正学到一些东西。

⚠️ You will have to recreate your model and your Trainer after this test, as the model obtained probably won’t be able to recover and learn something useful on your full dataset.

⚠️在此测试后,您必须重新创建您的模型和您的“训练”,因为所获得的模型可能无法在您的全部数据集上恢复和学习有用的东西。

在您有了第一个基准之前,不要调整任何内容

Don’t tune anything until you have a first baseline

超参数调优总是被强调为机器学习中最难的部分,但这只是帮助您获得一点度量的最后一步。大多数情况下,Traine的默认超参数会很好地工作,为您提供良好的结果,因此,在找到超过数据集基线的内容之前,不要开始耗时且昂贵的超参数搜索。

Hyperparameter tuning is always emphasized as being the hardest part of machine learning, but it’s just the last step to help you gain a little bit on the metric. Most of the time, the default hyperparameters of the Trainer will work just fine to give you good results, so don’t launch into a time-consuming and costly hyperparameter search until you have something that beats the baseline you have on your dataset.

一旦你有了足够好的模型,你就可以开始进行一些调整了。不要试图用不同的超参数启动一千次运行,但要比较一个超参数具有不同值的几次运行,以了解哪一次运行的影响最大。

Once you have a good enough model, you can start tweaking a bit. Don’t try launching a thousand runs with different hyperparameters, but compare a couple of runs with different values for one hyperparameter to get an idea of which has the greatest impact.

如果你在调整模型本身,保持简单,不要尝试任何你无法合理证明的事情。一定要确保你重新进行过合身测试,以验证你的改变没有产生任何意想不到的后果。

If you are tweaking the model itself, keep it simple and don’t try anything you can’t reasonably justify. Always make sure you go back to the overfitting test to verify that your change hasn’t had any unintended consequences.

寻求帮助

Ask for help

希望您在这一节中找到了一些帮助您解决问题的建议,但如果情况并非如此,请记住您可以随时在论坛上询问社区。

Hopefully you will have found some advice in this section that helped you solve your issue, but if that’s not the case, remember you can always ask the community on the forums.

以下是一些可能会有帮助的其他资源:

Here are some additional resources that may prove helpful:

Joel Grus的《可重复性作为工程最佳实践的工具》Cecelia Shao的《如何单元测试机器学习代码》,Chase Roberts的《训练神经网络的秘诀》Andrej Karath的《训练神经网络的秘诀》当然,并不是你在训练神经网络时遇到的每一个问题都是你自己的错!如果您在🤗Transformer或🤗数据集库中遇到一些似乎不正确的内容,则可能是遇到了错误。您绝对应该告诉我们关于它的所有内容,在下一节中,我们将确切地解释如何做到这一点。

Of course, not every problem you encounter when training neural nets is your own fault! If you encounter something in the 🤗 Transformers or 🤗 Datasets library that does not seem right, you may have encountered a bug. You should definitely tell us all about it, and in the next section we’ll explain exactly how to do that.