中英文对照学习，效果更佳！
原课程链接：https://huggingface.co/course/chapter6/1?fw=pt

Introduction

引言

问一个问题

In [Chapter 3], we looked at how to fine-tune a model on a given task. When we do that, we use the same tokenizer that the model was pretrained with — but what do we do when we want to train a model from scratch? In these cases, using a tokenizer that was pretrained on a corpus from another domain or language is typically suboptimal. For example, a tokenizer that’s trained on an English corpus will perform poorly on a corpus of Japanese texts because the use of spaces and punctuation is very different in the two languages.

在第3章中，我们研究了如何针对给定任务微调模型。当我们这样做时，我们使用与预先训练模型相同的标记器-但是当我们想从头开始训练模型时，我们该怎么做呢？在这些情况下，使用在来自另一个领域或语言的语料库上预先训练的标记器通常不是最优的。例如，在英语语料库上训练的标记器在日语文本语料库上的性能会很差，因为空格和标点符号的使用在两种语言中非常不同。

In this chapter, you will learn how to train a brand new tokenizer on a corpus of texts, so it can then be used to pretrain a language model. This will all be done with the help of the 🤗 Tokenizers library, which provides the “fast” tokenizers in the 🤗 Transformers library. We’ll take a close look at the features that this library provides, and explore how the fast tokenizers differ from the “slow” versions.

在本章中，您将学习如何在文本语料库上训练一个全新的标记器，以便可以使用它来预训练语言模型。这一切都将在🤗令牌化器库的帮助下完成，该库在🤗Transformer库中提供了“快速”的令牌化器。我们将仔细研究该库提供的功能，并探索快速标记器与“慢”版本的不同之处。

Topics we will cover include:

我们将讨论的主题包括：

How to train a new tokenizer similar to the one used by a given checkpoint on a new corpus of texts
The special features of fast tokenizers
The differences between the three main subword tokenization algorithms used in NLP today
How to build a tokenizer from scratch with the 🤗 Tokenizers library and train it on some data

The techniques introduced in this chapter will prepare you for the section in Chapter 7 where we look at creating a language model for Python source code. Let’s start by looking at what it means to “train” a tokenizer in the first place.

如何在新的文本语料库上训练一个类似于给定检查点使用的新的标记器快速标记器的特殊功能NLP中使用的三个主要子词标记化算法之间的差异如何使用🤗标记器库从头开始构建一个标记器并对其进行训练本章介绍的技术将为第7章中的一节做好准备，在那里我们将介绍为Python源代码创建语言模型的部分。让我们首先来看看“训练”一个记号赋值器意味着什么。

Transformer

#Course

5-The_Datasets_library-6-_Datasets_check 上一篇

6-The_Tokenizers_library-1-Training_a_new_tokenizer_from_an_old_one 下一篇

6-The_Tokenizers_library-0-Introduction

Introduction

引言