6-The_Tokenizers_library-3-Fast_tokenizers_in_the_QA_pipeline
中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter6/4?fw=pt
Fast tokenizers in the QA pipeline
QA渠道中的快速令牌化器
We will now dive into the question-answering pipeline and see how to leverage the offsets to grab the answer to the question at hand from the context, a bit like we did for the grouped entities in the previous section. Then we will see how we can deal with very long contexts that end up being truncated. You can skip this section if you’re not interested in the question answering task.
Ask a Problem in Colab Open in Studio Lab我们现在将深入到“问题-答案”管道中,看看如何利用偏移量从上下文中获取手头问题的答案,就像我们在上一节中对分组实体所做的那样。然后,我们将看到如何处理最终被截断的非常长的上下文。如果您对问题回答任务不感兴趣,可以跳过此部分。
Using the question-answering pipeline
使用问题-答案管道
As we saw in [Chapter 1], we can use the question-answering pipeline like this to get the answer to a question:
正如我们在第一章中看到的,我们可以像这样使用`问题-答案‘管道来获得问题的答案:
1 | |
1 | |
Unlike the other pipelines, which can’t truncate and split texts that are longer than the maximum length accepted by the model (and thus may miss information at the end of a document), this pipeline can deal with very long contexts and will return the answer to the question even if it’s at the end:
与其他管道不同的是,其他管道不能截断和拆分超过模型接受的最大长度的文本(因此可能会错过文档末尾的信息),该管道可以处理非常长的上下文,并将返回问题的答案,即使它位于末尾:
1 | |
1 | |
Let’s see how it does all of this!
让我们看看它是如何做到这一切的!
Using a model for question answering
使用模型回答问题
Like with any other pipeline, we start by tokenizing our input and then send it through the model. The checkpoint used by default for the question-answering pipeline is distilbert-base-cased-distilled-squad (the “squad” in the name comes from the dataset on which the model was fine-tuned; we’ll talk more about the SQuAD dataset in Chapter 7):
与任何其他管道一样,我们首先对输入进行标记化,然后通过模型发送。Quest-Answering流水线默认使用的检查点是distilbert-base-case-distilleed-squad(名称中的“Team”来自对模型进行微调的数据集;我们将在第7章中详细介绍该小队数据集):
1 | |
Note that we tokenize the question and the context as a pair, with the question first.
请注意,我们将问题和上下文标记为一对,首先是问题。
Models for question answering work a little differently from the models we’ve seen up to now. Using the picture above as an example, the model has been trained to predict the index of the token starting the answer (here 21) and the index of the token where the answer ends (here 24). This is why those models don’t return one tensor of logits but two: one for the logits corresponding to the start token of the answer, and one for the logits corresponding to the end token of the answer. Since in this case we have only one input containing 66 tokens, we get:
问题和上下文的标记化示例问题和上下文模型的标记化的示例与我们到目前为止看到的模型略有不同。以上面的图片为例,该模型已经被训练成预测答案开始的令牌的索引(这里是21)和答案结束的令牌的索引(这里是24)。这就是为什么这些模型不返回一个Logit张量,而是返回两个张量:一个用于对应于答案的开始令牌的Logit,一个用于对应于答案的结束令牌的Logit。由于在本例中我们只有一个输入,其中包含66个令牌,因此我们得到:
1 | |
1 | |
To convert those logits into probabilities, we will apply a softmax function — but before that, we need to make sure we mask the indices that are not part of the context. Our input is [CLS] question [SEP] context [SEP], so we need to mask the tokens of the question as well as the [SEP] token. We’ll keep the [CLS] token, however, as some models use it to indicate that the answer is not in the context.
要将这些日志转换为概率,我们将应用Softmax函数-但在此之前,我们需要确保屏蔽不属于上下文的索引。我们的输入是[CLS]问题[SEP]上下文[SEP],所以我们需要屏蔽问题的标记以及[SEP]标记。但是,我们将保留[cls]标记,因为一些模型使用它来表示答案不在上下文中。
Since we will apply a softmax afterward, we just need to replace the logits we want to mask with a large negative number. Here, we use -10000:
因为之后我们将应用Softmax,所以我们只需要用一个较大的负数替换我们想要掩码的Logit。这里我们使用-10000:
1 | |
Now that we have properly masked the logits corresponding to positions we don’t want to predict, we can apply the softmax:
现在我们已经正确地屏蔽了对应于我们不想预测的位置的对数,我们可以应用Softmax:
1 | |
At this stage, we could take the argmax of the start and end probabilities — but we might end up with a start index that is greater than the end index, so we need to take a few more precautions. We will compute the probabilities of each possible start_index and end_index where start_index <= end_index, then take the tuple (start_index, end_index) with the highest probability.
在这个阶段,我们可以取开始概率和结束概率的argmax–但我们最终得到的开始指数可能大于结束指数,所以我们需要采取一些更多的预防措施。我们将计算每个可能的start_index和end_index其中start_index<=end_index的概率,然后取概率最高的元组(start_index,end_index)。
Assuming the events “The answer starts at start_index” and “The answer ends at end_index” to be independent, the probability that the answer starts at start_index and ends at end_index is:
start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start_probabilities}[\mathrm{start_index}] \times \mathrm{end_probabilities}[\mathrm{end_index}]start_probabilities[start_index]×end_probabilities[end_index]
假设“答案开始于开始_索引”和“答案结束于‘结束_索引”是独立的,则答案开始于开始_索引并结束于结束_索引`的概率为:start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start_probabilities}[\mathrm{start_index}]\×\mathrm{end_probabilities}[\mathrm{end_index}]start_probabilities[start_index]×end_probabilities[end_index]
So, to compute all the scores, we just need to compute all the products start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start_probabilities}[\mathrm{start_index}] \times \mathrm{end_probabilities}[\mathrm{end_index}]start_probabilities[start_index]×end_probabilities[end_index] where start_index <= end_index.
因此,要计算所有分数,我们只需要计算所有乘积start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start_probabilities}[\mathrm{start_index}]\x\mathrm{end_probabilities}[\mathrm{end_index}]start_probabilities[start_index]×end_probabilities[end_index],其中startindex<=end_index。
First let’s compute all the possible products:
首先,让我们计算所有可能的产品:
1 | |
Then we’ll mask the values where start_index > end_index by setting them to 0 (the other probabilities are all positive numbers). The torch.triu() function returns the upper triangular part of the 2D tensor passed as an argument, so it will do that masking for us:
然后通过将start_index>end_index的值设置为0来屏蔽它们(其他概率都是正数)。torch.triu()函数返回作为参数传递的2D张量的上三角部分,因此它将为我们执行该屏蔽:
1 | |
Now we just have to get the index of the maximum. Since PyTorch will return the index in the flattened tensor, we need to use the floor division // and modulus % operations to get the start_index and end_index:
现在我们只需要得到最大值的指数。由于PyTorch将返回展平张量中的索引,因此我们需要使用地板除法//和模%运算来获得start_index和end_index:
1 | |
We’re not quite done yet, but at least we already have the correct score for the answer (you can check this by comparing it to the first result in the previous section):
我们还没有完全完成,但至少我们已经有了正确的答案分数(您可以通过将其与上一节中的第一个结果进行比较来检查它):
1 | |
✏️ Try it out! Compute the start and end indices for the five most likely answers.
✏️试试看吧!计算最有可能的五个答案的开始和结束指数。
We have the start_index and end_index of the answer in terms of tokens, so now we just need to convert to the character indices in the context. This is where the offsets will be super useful. We can grab them and use them like we did in the token classification task:
我们有令牌形式的答案的start_index和end_index,所以现在我们只需要转换成上下文中的字符索引。这就是补偿将非常有用的地方。我们可以像在令牌分类任务中那样获取并使用它们:
1 | |
Now we just have to format everything to get our result:
现在,我们只需格式化所有内容即可获得结果:
1 | |
1 | |
Great! That’s the same as in our first example!
太棒了!这与我们的第一个示例中的相同!
✏️ Try it out! Use the best scores you computed earlier to show the five most likely answers. To check your results, go back to the first pipeline and pass in top_k=5 when calling it.
✏️试试看吧!用你之前计算的最好的分数来显示五个最有可能的答案。要查看结果,请返回第一个管道,并在调用时传入top_k=5。
Handling long contexts
处理长上下文
If we try to tokenize the question and long context we used as an example previously, we’ll get a number of tokens higher than the maximum length used in the question-answering pipeline (which is 384):
如果我们尝试对前面用作示例的问题和长上下文进行标记化,我们将获得比`问题-答案‘管道中使用的最大长度(即384)更高的标记数量:
1 | |
1 | |
So, we’ll need to truncate our inputs at that maximum length. There are several ways we can do this, but we don’t want to truncate the question, only the context. Since the context is the second sentence, we’ll use the "only_second" truncation strategy. The problem that arises then is that the answer to the question may not be in the truncated context. Here, for instance, we picked a question where the answer is toward the end of the context, and when we truncate it that answer is not present:
因此,我们需要在最大长度处截断我们的输入。我们有几种方法可以做到这一点,但我们不想截断问题,只想截断上下文。由于上下文是第二句,我们将使用“only_Second”截断策略。然后出现的问题是,问题的答案可能不是在截断的背景下。例如,在这里,我们选择了一个问题,其中答案在上下文的末尾,当我们截断它时,答案不存在:
1 | |
1 | |
This means the model will have a hard time picking the correct answer. To fix this, the question-answering pipeline allows us to split the context into smaller chunks, specifying the maximum length. To make sure we don’t split the context at exactly the wrong place to make it possible to find the answer, it also includes some overlap between the chunks.
这意味着该模型将很难找到正确的答案。为了解决这个问题,`问题-答案‘管道允许我们将上下文拆分成更小的块,指定最大长度。为了确保我们不会在错误的位置分割上下文以找到答案,它还包括块之间的一些重叠。
We can have the tokenizer (fast or slow) do this for us by adding return_overflowing_tokens=True, and we can specify the overlap we want with the stride argument. Here is an example, using a smaller sentence:
我们可以让标记器(快或慢)通过添加Return_OVERFLOW_TOKENS=True来完成这项工作,并且我们可以使用stride参数指定我们想要的重叠。下面是一个使用较小句子的例子:
1 | |
1 | |
As we can see, the sentence has been split into chunks in such a way that each entry in inputs["input_ids"] has at most 6 tokens (we would need to add padding to have the last entry be the same size as the others) and there is an overlap of 2 tokens between each of the entries.
正如我们所看到的,句子被分成块,使得inputs[“input_id”]中的每个条目最多有6个标记(我们需要添加填充以使最后一个条目与其他条目的大小相同),并且每个条目之间有2个标记重叠。
Let’s take a closer look at the result of the tokenization:
让我们更仔细地看看标记化的结果:
1 | |
1 | |
As expected, we get input IDs and an attention mask. The last key, overflow_to_sample_mapping, is a map that tells us which sentence each of the results corresponds to — here we have 7 results that all come from the (only) sentence we passed the tokenizer:
不出所料,我们得到了输入ID和注意掩码。最后一个关键字overflow_to_Sample_mapping是一个映射,它告诉我们每个结果对应于哪个语句–这里有7个结果,它们都来自我们传递给记号赋值器的(唯一)语句:
1 | |
1 | |
This is more useful when we tokenize several sentences together. For instance, this:
当我们一起标记化几个句子时,这会更有用。例如,这是:
1 | |
gets us:
让我们了解:
1 | |
which means the first sentence is split into 7 chunks as before, and the next 4 chunks come from the second sentence.
这意味着第一个句子像以前一样被分成7个块,接下来的4个块来自第二个句子。
Now let’s go back to our long context. By default the question-answering pipeline uses a maximum length of 384, as we mentioned earlier, and a stride of 128, which correspond to the way the model was fine-tuned (you can adjust those parameters by passing max_seq_len and stride arguments when calling the pipeline). We will thus use those parameters when tokenizing. We’ll also add padding (to have samples of the same length, so we can build tensors) as well as ask for the offsets:
现在让我们回到我们的长上下文中。默认情况下,问题-回答管道使用的最大长度为384,跨度为128,这与模型微调的方式相对应(您可以通过在调用管道时传递max_seq_len和stride参数来调整这些参数)。因此,我们将在标记化时使用这些参数。我们还将添加填充(以具有相同长度的样本,这样我们就可以构建张量)并请求偏移量:
1 | |
Those inputs will contain the input IDs and attention masks the model expects, as well as the offsets and the overflow_to_sample_mapping we just talked about. Since those two are not parameters used by the model, we’ll pop them out of the inputs (and we won’t store the map, since it’s not useful here) before converting it to a tensor:
这些inputs将包含模型期望的输入ID和注意掩码,以及偏移量和刚才提到的overflow_to_Sample_mapping。由于这两个参数不是模型使用的参数,因此在将其转换为张量之前,我们将把它们从inputs中取出(并且我们不会存储映射,因为它在这里没有用处):
1 | |
1 | |
Our long context was split in two, which means that after it goes through our model, we will have two sets of start and end logits:
我们的长上下文被一分为二,这意味着在它通过我们的模型之后,我们将有两组开始和结束日志:
1 | |
1 | |
Like before, we first mask the tokens that are not part of the context before taking the softmax. We also mask all the padding tokens (as flagged by the attention mask):
与前面一样,在获取Softmax之前,我们首先屏蔽不属于上下文一部分的令牌。我们还屏蔽了所有填充标记(如注意掩码所标记的):
1 | |
Then we can use the softmax to convert our logits to probabilities:
然后,我们可以使用Softmax将我们的日志转换为概率:
1 | |
The next step is similar to what we did for the small context, but we repeat it for each of our two chunks. We attribute a score to all possible spans of answer, then take the span with the best score:
下一步类似于我们对小上下文所做的操作,但我们对两个块中的每个块重复该步骤。我们给所有可能的答案区间打分,然后取得分最高的区间:
1 | |
1 | |
Those two candidates correspond to the best answers the model was able to find in each chunk. The model is way more confident the right answer is in the second part (which is a good sign!). Now we just have to map those two token spans to spans of characters in the context (we only need to map the second one to have our answer, but it’s interesting to see what the model has picked in the first chunk).
这两个候选者对应于该模型能够在每一块中找到的最佳答案。这个模型要自信得多,正确的答案在第二部分(这是一个好兆头!)现在,我们只需要将这两个令牌范围映射到上下文中的字符范围(我们只需要映射第二个令牌范围就可以得到我们的答案,但有趣的是,看看模型在第一个块中选择了什么)。
✏️ Try it out! Adapt the code above to return the scores and spans for the five most likely answers (in total, not per chunk).
✏️试试看吧!修改上面的代码以返回最有可能的五个答案的分数和跨度(总而言之,不是按块)。
The offsets we grabbed earlier is actually a list of offsets, with one list per chunk of text:
我们前面抓取的offsets实际上是一个偏移量列表,每段文本有一个列表:
1 | |
1 | |
If we ignore the first result, we get the same result as our pipeline for this long context — yay!
如果我们忽略第一个结果,我们得到的结果与我们针对这个长上下文的流水线相同-耶!
✏️ Try it out! Use the best scores you computed before to show the five most likely answers (for the whole context, not each chunk). To check your results, go back to the first pipeline and pass in top_k=5 when calling it.
✏️试试看吧!使用你之前计算的最佳分数来显示五个最有可能的答案(针对整个上下文,而不是每一块)。要查看结果,请返回第一个管道,并在调用时传入top_k=5。
This concludes our deep dive into the tokenizer’s capabilities. We will put all of this in practice again in the next chapter, when we show you how to fine-tune a model on a range of common NLP tasks.
这结束了我们对令牌器功能的深入研究。我们将在下一章中再次实践所有这些,届时我们将向您展示如何在一系列常见的NLP任务中微调模型。
