6-The_Tokenizers_library-10-End-of-chapter_quiz

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/3?fw=pt

End-of-chapter quiz

章末测验

Ask a Question

问一个问题

Let’s test what you learned in this chapter!

让我们测试一下你在这一章中学到了什么!

  1. When should you train a new tokenizer?

When your dataset is similar to that used by an existing pretrained model, and you want to pretrain a new model

您应该在什么时候训练新的标记器?当您的数据集与现有预先训练的模型使用的数据集相似,并且您想要预先训练新模型时

When your dataset is similar to that used by an existing pretrained model, and you want to fine-tune a new model using this pretrained model

当您的数据集与现有预训练模型使用的数据集相似,并且您想要使用此预训练模型微调新模型时

When your dataset is different from the one used by an existing pretrained model, and you want to pretrain a new model

当您的数据集不同于现有预先训练的模型所使用的数据集,并且您想要预先训练新模型时

When your dataset is different from the one used by an existing pretrained model, but you want to fine-tune a new model using this pretrained model

当您的数据集不同于现有预训练模型所使用的数据集,但您想要使用此预训练模型微调新模型时

  1. What is the advantage of using a generator of lists of texts compared to a list of lists of texts when using train_new_from_iterator()?

That’s the only type the method train_new_from_iterator() accepts.

使用train_new_from_iterator()时,与使用文本列表相比,使用文本列表生成器有什么优势?这是train_new_from_iterator()方法唯一接受的类型。

You will avoid loading the whole dataset into memory at once.

您将避免一次将整个数据集加载到内存中。

This will allow the 🤗 Tokenizers library to use multiprocessing.

这将允许🤗令牌化器库使用多处理。

The tokenizer you train will generate better texts.

您训练的词汇化器将生成更好的文本。

  1. What are the advantages of using a “fast” tokenizer?

It can process inputs faster than a slow tokenizer when you batch lots of inputs together.

使用“快速”标记器有什么优势?当你批量处理大量输入时,它可以比慢速标记器更快地处理输入。

Fast tokenizers always tokenize faster than their slow counterparts.

快速标记器的标记化速度始终快于其慢速标记器。

It can apply padding and truncation.

它可以应用填充和截断。

It has some additional features allowing you to map tokens to the span of text that created them.

它还有一些附加功能,允许您将令牌映射到创建它们的文本范围。

  1. How does the token-classification pipeline handle entities that span over several tokens?

The entities with the same label are merged into one entity.

令牌分类管道如何处理跨越多个令牌的实体?具有相同标签的实体被合并为一个实体。

There is a label for the beginning of an entity and a label for the continuation of an entity.

有一个用于实体开始的标签和一个用于实体继续的标签。

In a given word, as long as the first token has the label of the entity, the whole word is considered labeled with that entity.

在给定的单词中,只要第一个令牌具有该实体的标签,整个单词就被认为带有该实体的标签。

When a token has the label of a given entity, any other following token with the same label is considered part of the same entity, unless it’s labeled as the start of a new entity.

当令牌具有给定实体的标签时,具有相同标签的任何其他后续令牌都被视为同一实体的一部分,除非它被标记为新实体的开始。

“问题-回答”管道如何处理长上下文?它不是真的,因为它在模型接受的最大长度处截断长上下文。

  1. How does the question-answering pipeline handle long contexts?

It doesn’t really, as it truncates the long context at the maximum length accepted by the model.

它将上下文分为几个部分,并对获得的结果进行平均。

It splits the context into several parts and averages the results obtained.

它将上下文分成几个部分(有重叠),并在每个部分中找到答案的最大分数。

It splits the context into several parts (with overlap) and finds the maximum score for an answer in each part.

它将上下文分成几个部分(没有重叠,以提高效率),并在每个部分中找到答案的最大分数。

It splits the context into several parts (without overlap, for efficiency) and finds the maximum score for an answer in each part.

什么是标准化?它是标记器在初始阶段对文本执行的任何清理。

  1. What is normalization?

It’s any cleanup the tokenizer performs on the texts in the initial stages.

这是一种数据增强技术,它通过删除生词来使文本变得更正常。

It’s a data augmentation technique that involves making the text more normal by removing rare words.

这是令牌器添加特殊令牌的最后一个后处理步骤。

It’s the final post-processing step where the tokenizer adds the special tokens.

它是用平均值0和标准差1进行嵌入,减去平均值并除以标准差。

It’s when the embeddings are made with mean 0 and standard deviation 1, by subtracting the mean and dividing by the std.

什么是子字标记器的预标记化?这是标记化之前的步骤,其中应用了数据增强(如随机掩码)。

  1. What is pre-tokenization for a subword tokenizer?

It’s the step before the tokenization, where data augmentation (like random masking) is applied.

这是标记化之前的步骤,需要对文本应用所需的清理操作。

It’s the step before the tokenization, where the desired cleanup operations are applied to the text.

这是应用记号赋值器模型之前的一步,将输入拆分成单词。

It’s the step before the tokenizer model is applied, to split the input into words.

这是应用记号器模型之前的一步,将输入拆分成记号。

It’s the step before the tokenizer model is applied, to split the input into tokens.

选择适用于标记化的BPE模型的句子。BPE是一种子词标记化算法,它从较小的词汇表开始,并学习合并规则。

  1. Select the sentences that apply to the BPE model of tokenization.

BPE is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.

BPE是一种子词标记化算法,它从大词汇表开始,然后逐渐删除词汇表中的标记。

BPE is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.

BPE令牌化器通过合并最频繁的令牌对来学习合并规则。

BPE tokenizers learn merge rules by merging the pair of tokens that is the most frequent.

BPE记号赋值器通过合并记号对来学习合并规则,所述记号对最大化赋予频繁对和较不频繁的单个部分的特权的分数。

A BPE tokenizer learns a merge rule by merging the pair of tokens that maximizes a score that privileges frequent pairs with less frequent individual parts.

BPE通过将单词拆分成字符,然后应用合并规则,将单词标记为子词。

BPE tokenizes words into subwords by splitting them into characters and then applying the merge rules.

BPE通过从词汇表中的开头开始查找最长的子词,然后对文本的其余部分重复该过程,将单词标记为子词。

BPE tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.

选择适用于词汇化的WordPiess模型的句子。WordPiess是一种子词标记化算法,它从较小的词汇表开始,并学习合并规则。

  1. Select the sentences that apply to the WordPiece model of tokenization.

WordPiece is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.

WordPiess是一种子词标记化算法,它从大词汇表开始,然后逐渐删除词汇表中的标记。

WordPiece is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.

WordPiess标记器通过合并最频繁的一对标记来学习合并规则。

WordPiece tokenizers learn merge rules by merging the pair of tokens that is the most frequent.

WordPiess标记器通过合并该对标记来学习合并规则,该对标记最大化了赋予频繁对和较不频繁的单个部分的特权的分数。

A WordPiece tokenizer learns a merge rule by merging the pair of tokens that maximizes a score that privileges frequent pairs with less frequent individual parts.

根据该模型,WordPiess通过找到最有可能分割成标记的方式,将单词标记为子词。

WordPiece tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.

WordPiess通过从词汇表中从开头开始查找最长的子词,然后对文本的其余部分重复该过程,将单词标记为子词。

WordPiece tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.

选择适用于标记化的Unigram模型的句子。Unigram是一种子词标记化算法,它从较小的词汇量开始,并学习合并规则。

  1. Select the sentences that apply to the Unigram model of tokenization.

Unigram is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.

Unigram是一种子词标记化算法,它从大词汇表开始,然后逐渐删除其中的标记词。

Unigram is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.

Unigram通过最小化整个语料库计算的损失来调整其词汇量。

Unigram adapts its vocabulary by minimizing a loss computed over the whole corpus.

Unigram通过保留最频繁的子词来调整其词汇。

Unigram adapts its vocabulary by keeping the most frequent subwords.

根据该模型,Unigram通过找到最有可能分割成标记的方式将单词标记为子词。

Unigram tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.

Unigram通过将单词拆分成字符,然后应用合并规则,将单词标记为子词。

Unigram tokenizes words into subwords by splitting them into characters, then applying the merge rules.