7-Main_NLP_tasks-8-End-of-chapter_quiz
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter7/9?fw=pt
End-of-chapter quiz
章末测验
问一个问题
Let’s test what you learned in this chapter!
让我们测试一下你在这一章中学到了什么!
- Which of the following tasks can be framed as a token classification problem?
Find the grammatical components in a sentence.
下列哪项任务可以归类为记号分类问题?找出句子中的语法成分。
Find whether a sentence is grammatically correct or not.
找出一个句子在语法上是否正确。
Find the persons mentioned in a sentence.
找出句子中提到的人。
Find the chunk of words in a sentence that answers a question.
找出句子中回答问题的词块。
- What part of the preprocessing for token classification differs from the other preprocessing pipelines?
There is no need to do anything; the texts are already tokenized.
令牌分类的预处理与其他预处理管道有什么不同?不需要做任何事情;文本已经被标记化了。
The texts are given as words, so we only need to apply subword tokenization.
文本是以单词的形式给出的,因此我们只需要应用子词标记化。
We use -100 to label the special tokens.
我们用-100来标记特殊的代币。
We need to make sure to truncate or pad the labels to the same size as the inputs, when applying truncation/padding.
在应用截断/填充时,我们需要确保将标签截断或填充到与输入相同的大小。
- What problem arises when we tokenize the words in a token classification problem and want to label the tokens?
The tokenizer adds special tokens and we have no labels for them.
当我们对令牌分类问题中的单词进行标记化并希望标记这些标记时,会出现什么问题?标记器会添加特殊的标记,而我们没有针对它们的标记。
Each word can produce several tokens, so we end up with more tokens than we have labels.
每个单词可以产生几个标记,所以我们最终得到的标记比我们拥有的标签多。
The added tokens have no labels, so there is no problem.
添加的令牌没有标签,因此不存在问题。
- What does “domain adaptation” mean?
It’s when we run a model on a dataset and get the predictions for each sample in that dataset.
“领域适应”是什么意思?它是当我们在一个数据集上运行一个模型并获得该数据集中每个样本的预测时。
It’s when we train a model on a dataset.
这是我们在数据集上训练模型的时候。
It’s when we fine-tune a pretrained model on a new dataset, and it gives predictions that are more adapted to that dataset
这是当我们在新的数据集上微调预先训练的模型时,它会给出更适合该数据集的预测
It’s when we add misclassified samples to a dataset to make our model more robust.
这是当我们将错误分类的样本添加到数据集以使我们的模型更健壮时。
- What are the labels in a masked language modeling problem?
Some of the tokens in the input sentence are randomly masked and the labels are the original input tokens.
被掩蔽的语言建模问题中的标签是什么?输入句子中的一些标记是随机掩蔽的,这些标记是原始输入标记。
Some of the tokens in the input sentence are randomly masked and the labels are the original input tokens, shifted to the left.
输入句子中的一些标记被随机掩蔽,标签是原始的输入标记,向左移动。
Some of the tokens in the input sentence are randomly masked, and the label is whether the sentence is positive or negative.
输入句子中的一些标记被随机掩蔽,标签是句子是肯定的还是否定的。
Some of the tokens in the two input sentences are randomly masked, and the label is whether the two sentences are similar or not.
两个输入句子中的一些标记被随机掩蔽,标签是这两个句子是否相似。
- Which of these tasks can be seen as a sequence-to-sequence problem?
Writing short reviews of long documents
这些任务中哪些可以被视为顺序到顺序的问题?写长篇文档的简短评论
Answering questions about a document
回答有关文件的问题
Translating a text in Chinese into English
将一篇汉语文本翻译成英语
Fixing the messages sent by my nephew/friend so they’re in proper English
修复我侄子/朋友发送的消息,使其使用正确的英语
- What is the proper way to preprocess the data for a sequence-to-sequence problem?
The inputs and targets have to be sent together to the tokenizer with inputs=... and targets=....
对于序列到序列问题,什么是对数据进行预处理的正确方法?必须将输入和目标一起发送到标记器,并使用ins=...‘和Target=…’。
The inputs and the targets both have to be preprocessed, in two separate calls to the tokenizer.
输入和目标都必须在对记号赋值器的两个单独调用中进行预处理。
As usual, we just have to tokenize the inputs.
像往常一样,我们只需将输入标记化即可。
The inputs have to be sent to the tokenizer, and the targets too, but under a special context manager.
输入必须被发送到记号赋值器,目标也必须被发送,但是在一个特殊的上下文管理器下。
- Why is there a specific subclass of
Trainerfor sequence-to-sequence problems?
Because sequence-to-sequence problems use a custom loss, to ignore the labels set to -100
为什么序列到序列问题有一个特定的Traine子类?因为序列到序列问题使用自定义损失,所以忽略设置为-100的标签
Because sequence-to-sequence problems require a special evaluation loop
因为序列到序列的问题需要特殊的求值循环
Because the targets are texts in sequence-to-sequence problems
因为目标是顺序到顺序问题中的文本
Because we use two models in sequence-to-sequence problems
因为我们在序列到序列问题中使用了两个模型
- When should you pretrain a new model?
When there is no pretrained model available for your specific language
你应该在什么时候预先训练一个新的模型?当没有预先训练的模型可用于你的特定语言时
When you have lots of data available, even if there is a pretrained model that could work on it
当您有大量数据可用时,即使有一个预先训练好的模型可以对其起作用
When you have concerns about the bias of the pretrained model you are using
当您对正在使用的预先训练的模型的偏差感到担忧时
When the pretrained models available are just not good enough
当可用的预先训练的模型不够好时
- Why is it easy to pretrain a language model on lots and lots of texts?
Because there are plenty of texts available on the internet
为什么在大量文本上预先训练语言模型很容易呢?因为互联网上有大量的文本可用
Because the pretraining objective does not require humans to label the data
因为预训练目标不需要人类对数据进行标记
Because the 🤗 Transformers library only requires a few lines of code to start the training
因为🤗Transformers库只需要几行代码就可以开始培训
- What are the main challenges when preprocessing data for a question answering task?
You need to tokenize the inputs.
在为问答任务预处理数据时,主要的挑战是什么?您需要对输入进行标记化。
You need to deal with very long contexts, which give several training features that may or may not have the answer in them.
您需要处理非常长的上下文,这给出了几个可能有答案也可能没有答案的培训功能。
You need to tokenize the answers to the question as well as the inputs.
您需要标记化问题的答案以及输入。
From the answer span in the text, you have to find the start and end token in the tokenized input.
根据文本中的答案范围,您必须在标记化的输入中找到开始和结束标记。
- How is post-processing usually done in question answering?
The model gives you the start and end positions of the answer, and you just have to decode the corresponding span of tokens.
在问答过程中,后处理通常是如何完成的?该模型给出了答案的开始和结束位置,您只需解码相应的标记范围。
The model gives you the start and end positions of the answer for each feature created by one example, and you just have to decode the corresponding span of tokens in the one that has the best score.
该模型为您提供了由一个示例创建的每个要素的答案的开始和结束位置,您只需在得分最高的一个中解码相应的令牌跨度。
The model gives you the start and end positions of the answer for each feature created by one example, and you just have to match them to the span in the context for the one that has the best score.
该模型为您提供了由一个示例创建的每个要素的答案的开始和结束位置,您只需将它们与上下文中的范围进行匹配,即可获得最好的分数。
The model generates an answer, and you just have to decode it.
该模型会生成一个答案,你只需将其解码即可。
