5-The_Datasets_library-2-Time_to_slice_and_dice

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter5/3?fw=pt

Time to slice and dice

是时候切块了

Ask a Question
Open In Colab
Open In Studio Lab
Most of the time, the data you work with won’t be perfectly prepared for training models. In this section we’ll explore the various features that 🤗 Datasets provides to clean up your datasets.

在Studio Lab的Colab Open中打开问题大多数情况下,您处理的数据不会为培训模型做好充分准备。在本节中,我们将探索🤗DataSets提供的各种功能来清理您的数据集。

Slicing and dicing our data

对我们的数据进行切片和分割

Similar to Pandas, 🤗 Datasets provides several functions to manipulate the contents of Dataset and DatasetDict objects. We already encountered the Dataset.map() method in [Chapter 3], and in this section we’ll explore some of the other functions at our disposal.

与Pandas类似,🤗DataSet提供了几个函数来操作DatasetDatasetDict对象的内容。我们已经在第3章中遇到了Dataset.map()方法,在这一节中,我们将探索我们可以使用的其他一些函数。

For this example we’ll use the Drug Review Dataset that’s hosted on the UC Irvine Machine Learning Repository, which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

在这个例子中,我们将使用加州大学欧文分校机器学习资料库上托管的药物回顾数据集,其中包含患者对各种药物的评论,以及正在治疗的病情和患者满意度的10星评级。

First we need to download and extract the data, which can be done with the wget and unzip commands:

首先我们需要下载并解压数据,可以通过wgetunzip命令来完成:

1
2
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

Since TSV is just a variant of CSV that uses tabs instead of commas as the separator, we can load these files by using the csv loading script and specifying the delimiter argument in the load_dataset() function as follows:

由于TSV只是CSV的一个变体,使用制表符而不是逗号作为分隔符,我们可以使用csv加载脚本并在Load_DataSet()函数中指定delimiter参数来加载这些文件,如下所示:

1
2
3
4
5
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data you’re working with. In 🤗 Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

在进行任何类型的数据分析时,一个好的做法是抓取一个小的随机样本,以快速了解您正在处理的数据类型。在🤗数据集中,我们可以通过将Dataset.Shuffle()Dataset.Select()函数链接在一起来创建随机样本:

1
2
3
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]
1
2
3
4
5
6
7
8
9
{'Unnamed: 0': [87571, 178045, 80482],
'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
'review': ['"like the previous person mention, I'm a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
'"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
'"I have been taking Mobic for over a year with no side effects other than an elevated blood pressure. I had severe knee and ankle pain which completely went away after taking Mobic. I attempted to stop the medication however pain returned after a few days."'],
'rating': [9.0, 3.0, 10.0],
'date': ['September 2, 2015', 'November 7, 2011', 'June 5, 2013'],
'usefulCount': [36, 13, 128]}

Note that we’ve fixed the seed in Dataset.shuffle() for reproducibility purposes. Dataset.select() expects an iterable of indices, so we’ve passed range(1000) to grab the first 1,000 examples from the shuffled dataset. From this sample we can already see a few quirks in our dataset:

请注意,出于可重现性的目的,我们已经修复了Dataset.shashffle()中的种子。Dataset.select()需要一个可迭代的索引,因此我们传递了range(1000)来从混洗后的数据集中获取前1000个示例。从这个示例中,我们已经可以看到我们的数据集中的一些异常情况:

  • The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
  • The condition column includes a mix of uppercase and lowercase labels.
  • The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

Let’s see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:

`未命名:0‘列看起来像是每个患者的匿名ID。Condition列包括大小写标签。评论的长度各不相同,包含了Python行分隔符(\r\n)和像&##039;这样的Html字符代码。让我们看看如何使用🤗数据集来处理每一个问题。要测试Uname:0列的患者ID假设,我们可以使用Dataset.Unique()函数来验证ID的数量是否与每个拆分中的行数匹配:

1
2
for split in drug_dataset.keys():
assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let’s clean up the dataset a bit by renaming the Unnamed: 0 column to something a bit more interpretable. We can use the DatasetDict.rename_column() function to rename the column across both splits in one go:

这似乎证实了我们的假设,所以让我们稍微清理一下数据集,将Uname:0‘列重命名为更易于解释的名称。我们可以使用DatasetDict.rename_Column()`函数一次性在两个拆分中重命名列:

1
2
3
4
drug_dataset = drug_dataset.rename_column(
original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset
1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
num_rows: 161297
})
test: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
num_rows: 53766
})
})

✏️ Try it out! Use the Dataset.unique() function to find the number of unique drugs and conditions in the training and test sets.

✏️试试看吧!使用Dataset.Unique()函数查找训练和测试集中唯一药物的数量和条件。

Next, let’s normalize all the condition labels using Dataset.map(). As we did with tokenization in [Chapter 3], we can define a simple function that can be applied across all the rows of each split in drug_dataset:

接下来,让我们使用Dataset.map()对所有Condition标签进行规范化。就像我们在第3章中对标记化所做的那样,我们可以定义一个简单的函数,该函数可以应用于Drug_Dataset中每个拆分的所有行:

1
2
3
4
5
def lowercase_condition(example):
return {"condition": example["condition"].lower()}


drug_dataset.map(lowercase_condition)
1
AttributeError: 'NoneType' object has no attribute 'lower'

Oh no, we’ve run into a problem with our map function! From the error we can infer that some of the entries in the condition column are None, which cannot be lowercased as they’re not strings. Let’s drop these rows using Dataset.filter(), which works in a similar way to Dataset.map() and expects a function that receives a single example of the dataset. Instead of writing an explicit function like:

哦,糟了,我们的地图功能遇到了问题!从错误中我们可以推断,Condition列中的一些条目是None‘,不能小写,因为它们不是字符串。让我们使用Dataset.Filter()删除这些行,它的工作方式类似于Dataset.map()`,并且需要一个接收单个数据集示例的函数。而不是编写一个显式函数,如:

1
2
def filter_nones(x):
return x["condition"] is not None

and then running drug_dataset.filter(filter_nones), we can do this in one line using a lambda function. In Python, lambda functions are small functions that you can define without explicitly naming them. They take the general form:

然后运行Pharmic_Datet.Filter(Filter_Nones),我们可以使用lambda函数在一行中完成此操作。在Python中,lambda函数是无需显式命名即可定义的小函数。它们的一般形式是:

1
lambda <arguments> : <expression>

where lambda is one of Python’s special keywords, <arguments> is a list/set of comma-separated values that define the inputs to the function, and <expression> represents the operations you wish to execute. For example, we can define a simple lambda function that squares a number as follows:

其中,lambda是Python的特殊关键字之一,<arguments>是定义函数输入的逗号分隔值的列表/集合,<Expression>表示您希望执行的操作。例如,我们可以定义一个简单的lambda函数来对一个数字平方,如下所示:

1
lambda x : x * x

To apply this function to an input, we need to wrap it and the input in parentheses:

要将此函数应用于输入,我们需要将其和输入括在圆括号中:

1
(lambda x: x * x)(3)
1
9

Similarly, we can define lambda functions with multiple arguments by separating them with commas. For example, we can compute the area of a triangle as follows:

同样,我们可以用逗号分隔多个参数来定义lambda函数。例如,我们可以按如下方式计算三角形的面积:

1
(lambda base, height: 0.5 * base * height)(4, 8)
1
16.0

Lambda functions are handy when you want to define small, single-use functions (for more information about them, we recommend reading the excellent Real Python tutorial by Andre Burgaud). In the 🤗 Datasets context, we can use lambda functions to define simple map and filter operations, so let’s use this trick to eliminate the None entries in our dataset:

当您想要定义小的、一次性使用的函数时,Lambda函数很方便(有关它们的更多信息,我们推荐阅读Andre Burgud编写的优秀的Real Python教程)。在🤗数据集上下文中,我们可以使用lambda函数来定义简单的映射和筛选操作,因此让我们使用此技巧来消除数据集中的`None‘条目:

1
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

With the None entries removed, we can normalize our condition column:

删除None条目后,我们可以正常化我们的Condition列:

1
2
3
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]
1
['left ventricular dysfunction', 'adhd', 'birth control']

It works! Now that we’ve cleaned up the labels, let’s take a look at cleaning up the reviews themselves.

它起作用了!既然我们已经清理了标签,让我们来看看如何清理评论本身。

Creating new columns

创建新列

Whenever you’re dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like “Great!” or a full-blown essay with thousands of words, and depending on the use case you’ll need to handle these extremes differently. To compute the number of words in each review, we’ll use a rough heuristic based on splitting each text by whitespace.

每当你处理客户评论时,一个好的做法是检查每条评论的字数。一篇评论可能只是一个词,比如“太棒了!”或者是一篇数千字的长篇文章,根据用例的不同,您需要以不同的方式处理这些极端情况。为了计算每篇评论中的词数,我们将使用一种粗略的启发式方法,该方法基于按空格分割每个文本。

Let’s define a simple function that counts the number of words in each review:

让我们定义一个简单的函数来计算每次复习中的字数:

1
2
def compute_review_length(example):
return {"review_length": len(example["review"].split())}

Unlike our lowercase_condition() function, compute_review_length() returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when compute_review_length() is passed to Dataset.map(), it will be applied to all the rows in the dataset to create a new review_length column:

与我们的lowercase_Condition()函数不同,COMPUTE_REVIEW_LENGTH()返回一个字典,该字典的键与数据集中的某个列名不对应。在这种情况下,当COMPUTE_REVIEW_LENGTH()传递给Dataset.map()时,它将应用于数据集中的所有行,以创建一个新的REVIEW_LENGTH列:

1
2
3
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]
1
2
3
4
5
6
7
8
{'patient_id': 206461,
'drugName': 'Valsartan',
'condition': 'left ventricular dysfunction',
'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
'rating': 9.0,
'date': 'May 20, 2012',
'usefulCount': 27,
'review_length': 17}

As expected, we can see a review_length column has been added to our training set. We can sort this new column with Dataset.sort() to see what the extreme values look like:

不出所料,我们可以看到,我们的训练集中添加了一个Review_Length列。我们可以使用Dataset.ort()对这个新列进行排序,以查看极值是什么样子的:

1
drug_dataset["train"].sort("review_length")[:3]
1
2
3
4
5
6
7
8
{'patient_id': [103488, 23627, 20558],
'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
'condition': ['birth control', 'muscle spasm', 'pain'],
'review': ['"Excellent."', '"useless"', '"ok"'],
'rating': [10.0, 1.0, 6.0],
'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
'usefulCount': [5, 2, 10],
'review_length': [1, 1, 1]}

As we suspected, some reviews contain just a single word, which, although it may be okay for sentiment analysis, would not be informative if we want to predict the condition.

正如我们所怀疑的那样,一些评论只包含一个单词,尽管它对于情绪分析来说可能是可以的,但如果我们想要预测病情,它将不会提供信息。

🙋 An alternative way to add new columns to a dataset is with the Dataset.add_column() function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where Dataset.map() is not well suited for your analysis.

🙋向数据集中添加新列的另一种方法是使用Dataset.addColumn()函数。这允许您以Python列表或NumPy数组的形式提供列,并且在Dataset.map()不太适合您的分析的情况下会很方便。

Let’s use the Dataset.filter() function to remove reviews that contain fewer than 30 words. Similarly to what we did with the condition column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

让我们使用Dataset.Filter()函数来删除少于30个单词的评论。类似于我们对Condition列的处理,我们可以通过要求评论的长度超过这个阈值来过滤掉非常短的评论:

1
2
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)
1
{'train': 138514, 'test': 46108}

As you can see, this has removed around 15% of the reviews from our original training and test sets.

如您所见,这从我们最初的培训和测试集中减少了大约15%的审查。

✏️ Try it out! Use the Dataset.sort() function to inspect the reviews with the largest numbers of words. See the documentation to see which argument you need to use sort the reviews by length in descending order.

✏️试试看吧!使用Dataset.ort()函数检查字数最多的评论。请参阅文档以了解您需要使用哪个参数按长度降序对评论进行排序。

The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python’s html module to unescape these characters, like so:

我们需要处理的最后一件事是在我们的评论中出现了HTML字符代码。我们可以使用Python的html模块对这些字符进行反转义,如下所示:

1
2
3
4
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)
1
"I'm a transformer called BERT"

We’ll use Dataset.map() to unescape all the HTML characters in our corpus:

我们将使用Dataset.map()对语料库中的所有HTML字符进行取消转义:

1
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

As you can see, the Dataset.map() method is quite useful for processing data — and we haven’t even scratched the surface of everything it can do!

正如您所看到的,Dataset.map()方法对于处理数据非常有用–我们甚至还没有触及它所能做的一切的皮毛!

The map() method’s superpowers

`map()‘方法的超强功能

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

`Dataset.map()方法接受一个Batched参数,如果设置为True`,它会立即向map函数发送一批示例(批大小是可配置的,但默认为1,000)。例如,前面取消转义所有HTML的map函数需要一些时间才能运行(您可以从进度条中读取所用的时间)。我们可以通过使用列表理解同时处理几个元素来加速这一过程。

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using batched=True:

当您指定Batched=True时,该函数接收一个包含数据集的字段的字典,但现在每个值都是一个值列表,而不仅仅是一个值。Dataset.map()的返回值应该是相同的:一个包含我们想要更新或添加到数据集中的字段的字典和一个值列表。例如,下面是另一种取消转义所有HTML字符的方法,但使用的是Batcher=True

1
2
3
new_drug_dataset = drug_dataset.map(
lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

If you’re running this code in a notebook, you’ll see that this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

如果您在笔记本上运行此代码,您将看到此命令的执行速度比前一个命令快得多。这并不是因为我们的审查已经是未转义的HTML–如果您重新执行上一节中的指令(没有Batches=True),它将花费与以前相同的时间。这是因为列表理解通常比在`for‘循环中执行相同的代码更快,而且我们还通过同时访问许多元素而不是逐个访问来获得一些性能。

Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers that we’ll encounter in [Chapter 6], which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

Dataset.map()Batted=True配合使用,对于提高我们将在第6章中遇到的“快速”标记器的速度至关重要,因为它可以快速标记化大的文本列表。例如,要使用快速标记器对所有药品评论进行标记化,我们可以使用如下函数:

1
2
3
4
5
6
7
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
return tokenizer(examples["review"], truncation=True)

As you saw in [Chapter 3], we can pass one or several examples to the tokenizer, so we can use this function with or without batched=True. Let’s take this opportunity to compare the performance of the different options. In a notebook, you can time a one-line instruction by adding %time before the line of code you wish to measure:

正如您在第3章中看到的,我们可以将一个或多个示例传递给记号赋值器,因此我们可以使用此函数,无论是否带有Batches=True。让我们借此机会比较一下不同选项的表现。在笔记本中,您可以通过在要测量的代码行之前添加%time来为一行指令计时:

1
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

You can also time a whole cell by putting %%time at the beginning of the cell. On the hardware we executed this on, it showed 10.8s for this instruction (it’s the number written after “Wall time”).

您还可以通过在单元格的开头添加%%time来对整个单元格进行计时。在我们执行此命令的硬件上,这条指令显示为10.8秒(这是写在“Wall Time”之后的数字)。

✏️ Try it out! Execute the same instruction with and without batched=True, then try it with a slow tokenizer (add use_fast=False in the AutoTokenizer.from_pretrained() method) so you can see what numbers you get on your hardware.

✏️试试看吧!执行相同的指令,带和不带Batched=True,然后用慢速标记器尝试(在AutoTokenizer.from_preTraded()方法中添加use_fast=False),这样您就可以看到您的硬件上得到了哪些数字。

Here are the results we obtained with and without batching, with a fast and a slow tokenizer:

以下是我们在使用和不使用批处理、使用快速和慢速标记器的情况下获得的结果:

Options 选项 Fast tokenizer 快速令牌器 Slow tokenizer 慢速标记器
batched=True `Batched=True` 10.8s 10.8s 4min41s 4分41秒
batched=False `Batched=False` 59.2s 59.2s 5min3s 5分3秒

This means that using a fast tokenizer with the batched=True option is 30 times faster than its slow counterpart with no batching — this is truly amazing! That’s the main reason why fast tokenizers are the default when using AutoTokenizer (and why they are called “fast”). They’re able to achieve such a speedup because behind the scenes the tokenization code is executed in Rust, which is a language that makes it easy to parallelize code execution.

这意味着使用带有Batches=True选项的快速标记器比不使用批处理的慢速标记器快30倍–这真是令人惊叹!这是使用AutoTokenizer时默认使用快速标记器的主要原因(也是它们被称为快速标记器的原因)。它们之所以能够实现这样的加速比,是因为在幕后,标记化代码是在Rust中执行的,Rust是一种使代码执行并行化变得容易的语言。

Parallelization is also the reason for the nearly 6x speedup the fast tokenizer achieves with batching: you can’t parallelize a single tokenization operation, but when you want to tokenize lots of texts at the same time you can just split the execution across several processes, each responsible for its own texts.

并行化也是快速标记器通过批处理实现近6倍加速的原因:您不能并行单个标记化操作,但当您希望同时标记化多个文本时,您可以将执行拆分到多个进程,每个进程负责其自己的文本。

Dataset.map() also has some parallelization capabilities of its own. Since they are not backed by Rust, they won’t let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you’re using a tokenizer that doesn’t have a fast version). To enable multiprocessing, use the num_proc argument and specify the number of processes to use in your call to Dataset.map():

`Dataset.map()也有自己的一些并行化能力。因为它们没有得到Rust的支持,所以它们不会让速度较慢的记号器赶上速度较快的记号器,但它们仍然可以提供帮助(特别是如果您正在使用没有快速版本的记号器)。要启用多处理,请使用num_proc参数并指定在调用Dataset.map()`时要使用的进程数:

1
2
3
4
5
6
7
8
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

You can experiment a little with timing to determine the optimal number of processes to use; in our case 8 seemed to produce the best speed gain. Here are the numbers we got with and without multiprocessing:

您可以稍微尝试一下计时,以确定要使用的最佳进程数;在我们的例子中,8似乎产生了最好的速度提升。以下是我们在使用和不使用多处理的情况下获得的数字:

Options 选项 Fast tokenizer 快速令牌器 Slow tokenizer 慢速标记器
batched=True `Batched=True` 10.8s 10.8s 4min41s 4分41秒
batched=False `Batched=False` 59.2s 59.2s 5min3s 5分3秒
batched=True, `Batted=Truenum_proc=8 \num_proc=8` 6.52s 6.52s 41.3s 41.3s
batched=False, `Batched=Falsenum_proc=8 \num_proc=8` 9.49s 9.49s 45.2s 45.2s

Those are much more reasonable results for the slow tokenizer, but the performance of the fast tokenizer was also substantially improved. Note, however, that won’t always be the case — for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.

这些结果对于慢速标记器来说要合理得多,但快速标记器的性能也得到了实质性的改善。然而,请注意,情况并不总是如此–对于除8以外的num_pro值,我们的测试表明,使用没有该选项的Batcher=True会更快。一般来说,我们不建议对Batted=True的快速标记器使用Python多处理。

Using num_proc to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

使用num_proc来加快处理速度通常是一个很好的主意,只要您使用的函数本身还没有进行某种多进程处理。

All of this functionality condensed into a single method is already pretty amazing, but there’s more! With Dataset.map() and batched=True you can change the number of elements in your dataset. This is super useful in many situations where you want to create several training features from one example, and we will need to do this as part of the preprocessing for several of the NLP tasks we’ll undertake in [Chapter 7].

所有这些浓缩到一个方法中的功能已经相当令人惊叹了,但还有更多!使用Dataset.map()Batcher=True可以更改数据集中的元素数量。这在许多情况下非常有用,您需要从一个示例创建多个培训特性,并且我们需要将其作为我们将在第7章中承担的几个NLP任务的预处理的一部分。

💡 In machine learning, an example is usually defined as the set of features that we feed to the model. In some contexts, these features will be the set of columns in a Dataset, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column.

💡在机器学习中,示例通常被定义为我们提供给模型的一组特征。在某些情况下,这些特征将是Dataset中的一组列,但在另一些情况下(如此处和用于问答),可以从单个示例中提取多个特征,并且这些特征属于单个列。

Let’s have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with return_overflowing_tokens=True:

让我们来看看它是如何工作的!在这里,我们将标记化我们的示例并将它们截断为最大长度为128,但我们将要求标记器返回所有文本块,而不仅仅是第一个文本块。可以通过Return_Overflow_tokens=True实现:

1
2
3
4
5
6
7
def tokenize_and_split(examples):
return tokenizer(
examples["review"],
truncation=True,
max_length=128,
return_overflowing_tokens=True,
)

Let’s test this on one example before using Dataset.map() on the whole dataset:

在对整个数据集使用Dataset.map()之前,让我们在一个示例中进行测试:

1
2
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]
1
[128, 49]

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let’s do this for all elements of the dataset!

因此,我们在训练集中的第一个示例变成了两个特征,因为它被标记化到超过我们指定的最大标记数量:第一个长度为128,第二个长度为49。现在,让我们对数据集的所有元素执行此操作!

1
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
1
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

Oh no! That didn’t work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you’ve looked at the Dataset.map() documentation, you may recall that it’s the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.

哦不!那不管用!为什么不行?查看错误消息将为我们提供线索:其中一列的长度不匹配,一列的长度为1,463,另一列的长度为1,000。如果您看过Dataset.map()文档,您可能还记得我们要映射的是传递给函数的样本数;在这里,这1,000个示例提供了1,463个新特性,从而导致形状错误。

The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the remove_columns argument:

问题是,我们试图混合两个不同大小的不同数据集:PharmicalDataSet列将具有一定数量的示例(错误中的1000个),但我们正在构建的tokenalized_dataset列将包含更多的示例(错误消息中的1,463个)。这对Dataset‘不起作用,因此我们需要从旧数据集中删除列,或者使它们的大小与新数据集中的大小相同。我们可以使用REMOVE_COLUNS`参数完成前者:

1
2
3
tokenized_dataset = drug_dataset.map(
tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

现在,这种方法工作正常,没有错误。我们可以通过比较长度来检查我们的新数据集是否比原始数据集包含更多的元素:

1
len(tokenized_dataset["train"]), len(drug_dataset["train"])
1
(206772, 138514)

We mentioned that we can also deal with the mismatched length problem by making the old columns the same size as the new ones. To do this, we will need the overflow_to_sample_mapping field the tokenizer returns when we set return_overflowing_tokens=True. It gives us a mapping from a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

我们提到,我们还可以通过使旧列与新列的大小相同来处理长度不匹配的问题。为此,当我们设置Return_OVERFLOW_TOKENS=True时,我们需要令牌化器返回的overflow_to_Sample_mapping字段。它为我们提供了从新的特征索引到其来源的样本的索引的映射。使用这种方法,我们可以将原始数据集中出现的每个键与适当大小的值列表相关联,方法是在生成新特性时重复每个示例的值多次:

1
2
3
4
5
6
7
8
9
10
11
12
def tokenize_and_split(examples):
result = tokenizer(
examples["review"],
truncation=True,
max_length=128,
return_overflowing_tokens=True,
)
# Extract mapping between new and old indices
sample_map = result.pop("overflow_to_sample_mapping")
for key, values in examples.items():
result[key] = [values[i] for i in sample_map]
return result

We can see it works with Dataset.map() without us needing to remove the old columns:

我们可以看到它与Dataset.map()一起工作,而不需要我们删除旧的列:

1
2
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset
1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['attention_mask', 'condition', 'date', 'drugName', 'input_ids', 'patient_id', 'rating', 'review', 'review_length', 'token_type_ids', 'usefulCount'],
num_rows: 206772
})
test: Dataset({
features: ['attention_mask', 'condition', 'date', 'drugName', 'input_ids', 'patient_id', 'rating', 'review', 'review_length', 'token_type_ids', 'usefulCount'],
num_rows: 68876
})
})

We get the same number of training features as before, but here we’ve kept all the old fields. If you need them for some post-processing after applying your model, you might want to use this approach.

我们得到了和以前一样多的训练功能,但这里我们保留了所有的旧场地。如果在应用模型后需要使用它们进行一些后处理,则可能需要使用此方法。

You’ve now seen how 🤗 Datasets can be used to preprocess a dataset in various ways. Although the processing functions of 🤗 Datasets will cover most of your model training needs,
there may be times when you’ll need to switch to Pandas to access more powerful features, like DataFrame.groupby() or high-level APIs for visualization. Fortunately, 🤗 Datasets is designed to be interoperable with libraries such as Pandas, NumPy, PyTorch, TensorFlow, and JAX. Let’s take a look at how this works.

现在,您已经了解了如何使用🤗数据集以各种方式对数据集进行预处理。虽然🤗DataSet的处理函数可以满足您的大部分模型训练需求,但有时您可能需要切换到Pandas来访问更强大的功能,如DataFrame.groupby()或用于可视化的高级API。幸运的是,🤗DataSet被设计为可以与诸如Pandas、NumPy、PyTorch、TensorFlow和JAX等库进行互操作。让我们来看看这是如何工作的。

From Datasets to DataFrames and back

Dataset‘s到DataFrame’s并返回

To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:

为了实现各种第三方库之间的转换,🤗DataSet提供了一个Dataset.set_Format()函数。此函数仅更改数据集的输出格式,因此您可以轻松切换到另一种格式,而不会影响底层数据格式,即ApacheArrow。格式化已就地完成。为了进行演示,让我们将我们的数据集转换为Pandas:

1
drug_dataset.set_format("pandas")

Now when we access elements of the dataset we get a pandas.DataFrame instead of a dictionary:

现在,当我们访问DataSet的元素时,我们得到一个anda as.DataFrame而不是一个字典:

1
drug_dataset["train"][:3]
patient_id 患者ID drugName 药品名称 condition 条件 review 回顾 rating 评级 date 日期 usefulCount 使用计数 review_length 审阅长度
0 0 95260 95260 Guanfacine 愈创法辛 adhd 多动症 “My son is halfway through his fourth week of Intuniv…” “我儿子在因图尼夫的第四个星期已经过半了……” 8.0 8.0 April 27, 2010 2010年4月27日 192 一百九十二 141 一百四十一
1 1 92703 92703 Lybrel 莱布瑞尔 birth control 节育 “I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects…” “我曾经服用过另一种口服避孕药,它有21个周期,非常快乐–非常轻微,最多5天,没有其他副作用……” 5.0 5.0 December 14, 2009 2009年12月14日 17 17 134 一百三十四
2 2. 138000 138000 Ortho Evra 奥托·埃夫拉 birth control 节育 “This is my first time using any form of birth control…” “这是我第一次使用任何形式的节育措施……” 8.0 8.0 November 3, 2015 2015年11月3日 10 10 89 八十九

Let’s create a pandas.DataFrame for the whole training set by selecting all the elements of drug_dataset["train"]:

让我们选择Drug_DataSet[“Train”]的所有元素,为整个训练集创建一个anda as.DataFrame

1
train_df = drug_dataset["train"][:]

🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s __getitem__() dunder method. This means that when we want to create a new object like train_df from a Dataset in the "pandas" format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset["train"] is Dataset, irrespective of the output format.

在幕后,Dataset.set_Format()更改DataSet的__🚨_()DUnder方法的返回格式。这意味着当我们想要从熊猫‘格式的Dataset创建一个新的对象时,我们需要对整个数据集进行切片,以获得一个anda as.DataFrame。无论输出格式如何,您都可以自行验证Drug_Dataset[“系列”]的类型是否为Dataset

From here we can use all the Pandas functionality that we want. For example, we can do fancy chaining to compute the class distribution among the condition entries:

从这里我们可以使用我们想要的所有熊猫功能。例如,我们可以进行花式链接来计算Condition条目之间的类分布:

1
2
3
4
5
6
7
8
frequencies = (
train_df["condition"]
.value_counts()
.to_frame()
.reset_index()
.rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()
condition 条件 frequency 频率,频率
0 0 birth control 节育 27655 27655
1 1 depression 抑郁症 8023 8023
2 2. acne 粉刺 5209 5209
3 3. anxiety 焦虑 4991 4991
4 4. pain 疼痛 4744 4744

And once we’re done with our Pandas analysis, we can always create a new Dataset object by using the Dataset.from_pandas() function as follows:

一旦我们完成了Pandas分析,我们总是可以通过使用Dataset.from_pandas()函数创建一个新的Dataset对象,如下所示:

1
2
3
4
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset
1
2
3
4
Dataset({
features: ['condition', 'frequency'],
num_rows: 819
})

✏️ Try it out! Compute the average rating per drug and store the result in a new Dataset.

✏️试试看吧!计算每种药物的平均评级,并将结果存储在新的`数据集‘中。

This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let’s create a validation set to prepare the dataset for training a classifier on. Before doing so, we’ll reset the output format of drug_dataset from "pandas" to "arrow":

这结束了我们对🤗数据集中可用的各种预处理技术的了解。为了完善这一部分,让我们创建一个验证集,以准备用于训练分类器的数据集。在此之前,我们会将药_数据集的输出格式从“熊猫”重置为“箭头”

1
drug_dataset.reset_format()

Creating a validation set

创建验证集

Although we have a test set we could use for evaluation, it’s a good practice to leave the test set untouched and create a separate validation set during development. Once you are happy with the performance of your models on the validation set, you can do a final sanity check on the test set. This process helps mitigate the risk that you’ll overfit to the test set and deploy a model that fails on real-world data.

尽管我们有可用于评估的测试集,但最好不要更改测试集,并在开发过程中创建单独的验证集。一旦您对验证集上的模型的性能感到满意,您就可以对测试集进行最后的健全性检查。此过程有助于降低过度适应测试集并部署在真实数据上失败的模型的风险。

🤗 Datasets provides a Dataset.train_test_split() function that is based on the famous functionality from scikit-learn. Let’s use it to split our training set into train and validation splits (we set the seed argument for reproducibility):

🤗DataSets提供了一个Dataset.Train_TEST_Split()函数,该函数基于SCISKIT-LEARN中著名的功能。让我们使用它将我们的训练集划分为TrainValidation拆分(我们设置了seed参数来表示可重复性):

1
2
3
4
5
6
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DatasetDict({
train: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
num_rows: 110811
})
validation: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
num_rows: 27703
})
test: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
num_rows: 46108
})
})

Great, we’ve now prepared a dataset that’s ready for training some models on! In section 5 we’ll show you how to upload datasets to the Hugging Face Hub, but for now let’s cap off our analysis by looking at a few ways you can save datasets on your local machine.

太好了,我们现在已经准备好了一个数据集,可以对一些模型进行培训了!在第5节中,我们将向您展示如何将数据集上传到Hugging Face中心,但现在让我们通过查看几种将数据集保存在本地计算机上的方法来结束我们的分析。

Saving a dataset

保存数据集

Although 🤗 Datasets will cache every downloaded dataset and the operations performed on it, there are times when you’ll want to save a dataset to disk (e.g., in case the cache gets deleted). As shown in the table below, 🤗 Datasets provides three main functions to save your dataset in different formats:

尽管🤗DataSet将缓存每个下载的数据集以及对其执行的操作,但有时您会希望将数据集保存到磁盘(例如,在缓存被删除的情况下)。如下表所示,🤗DataSets提供了三个主要函数来以不同的格式保存您的数据集:

Data format 数据格式 Function 功能
Arrow 箭 Dataset.save_to_disk() `Dataset.save_to_Disk()`
CSV CSV Dataset.to_csv() `Dataset.to_csv()`
JSON 杰森 Dataset.to_json() `Dataset.to_json()`

For example, let’s save our cleaned dataset in the Arrow format:

例如,让我们将清理后的数据集保存为Arrow格式:

1
drug_dataset_clean.save_to_disk("drug-reviews")

This will create a directory with the following structure:

这将创建一个具有以下结构的目录:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
drug-reviews/
├── dataset_dict.json
├── test
│ ├── dataset.arrow
│ ├── dataset_info.json
│ └── state.json
├── train
│ ├── dataset.arrow
│ ├── dataset_info.json
│ ├── indices.arrow
│ └── state.json
└── validation
├── dataset.arrow
├── dataset_info.json
├── indices.arrow
└── state.json

where we can see that each split is associated with its own dataset.arrow table, and some metadata in dataset_info.json and state.json. You can think of the Arrow format as a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

其中我们可以看到,每个拆分都与其自己的DataSet.Arrow表相关联,并与DataSet_info.json和state.json中的一些元数据相关联。您可以将Arrow格式视为一种别致的列和行表,它针对构建处理和传输大型数据集的高性能应用程序进行了优化。

Once the dataset is saved, we can load it by using the load_from_disk() function as follows:

保存数据集后,可以使用Load_from_Disk()函数进行加载,如下所示:

1
2
3
4
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DatasetDict({
train: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
num_rows: 110811
})
validation: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
num_rows: 27703
})
test: Dataset({
features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
num_rows: 46108
})
})

For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the DatasetDict object:

对于CSV和JSON格式,我们必须将每个拆分存储为单独的文件。一种方法是迭代DatasetDict对象中的键和值:

1
2
for split, dataset in drug_dataset_clean.items():
dataset.to_json(f"drug-reviews-{split}.jsonl")

This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON. Here’s what the first example looks like:

这会将每个拆分保存为JSON Lines格式,其中数据集中的每一行都存储为JSON的一行。下面是第一个示例:

1
!head -n 1 drug-reviews-train.jsonl
1
{"patient_id":141780,"drugName":"Escitalopram","condition":"depression","review":"\"I seemed to experience the regular side effects of LEXAPRO, insomnia, low sex drive, sleepiness during the day. I am taking it at night because my doctor said if it made me tired to take it at night. I assumed it would and started out taking it at night. Strange dreams, some pleasant. I was diagnosed with fibromyalgia. Seems to be helping with the pain. Have had anxiety and depression in my family, and have tried quite a few other medications that haven't worked. Only have been on it for two weeks but feel more positive in my mind, want to accomplish more in my life. Hopefully the side effects will dwindle away, worth it to stick with it from hearing others responses. Great medication.\"","rating":9.0,"date":"May 29, 2011","usefulCount":10,"review_length":125}

We can then use the techniques from section 2 to load the JSON files as follows:

然后,我们可以使用第2节中的技术加载JSON文件,如下所示:

1
2
3
4
5
6
data_files = {
"train": "drug-reviews-train.jsonl",
"validation": "drug-reviews-validation.jsonl",
"test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

And that’s it for our excursion into data wrangling with 🤗 Datasets! Now that we have a cleaned dataset for training a model on, here are a few ideas that you could try out:

这就是我们关于与🤗数据集的数据争论的内容!现在我们已经有了一个用于训练模型的干净的数据集,下面是一些您可以尝试的想法:

  1. Use the techniques from [Chapter 3] to train a classifier that can predict the patient condition based on the drug review.
  2. Use the summarization pipeline from [Chapter 1] to generate summaries of the reviews.

Next, we’ll take a look at how 🤗 Datasets can enable you to work with huge datasets without blowing up your laptop!

使用第3章中的技术来训练分类器,该分类器可以根据药物综述预测患者的病情。使用第1章中的“摘要”管道来生成综述摘要。接下来,我们将看看🤗数据集如何使您能够在不炸毁笔记本电脑的情况下处理大型数据集!