5-The_Datasets_library-3-Big_data_Datasets_to_the_rescue

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter5/4?fw=pt

Big data? 🤗 Datasets to the rescue!

大数据?🤗数据集来拯救你了!

Ask a Question
Open In Colab
Open In Studio Lab
Nowadays it is not uncommon to find yourself working with multi-gigabyte datasets, especially if you’re planning to pretrain a transformer like BERT or GPT-2 from scratch. In these cases, even loading the data can be a challenge. For example, the WebText corpus used to pretrain GPT-2 consists of over 8 million documents and 40 GB of text — loading this into your laptop’s RAM is likely to give it a heart attack!

在工作室实验室的Colab Open Open中提出问题如今,您经常会发现自己在处理数GB的数据集,特别是如果您计划从头开始对像BERT或GPT-2这样的Transformer进行预培训。在这些情况下,即使是加载数据也可能是一项挑战。例如,用于预先训练GPT-2的WebText语料库由800多万个文档和40 GB的文本组成–将这些文档加载到笔记本电脑的RAM中可能会让它心脏病发作!

Fortunately, 🤗 Datasets has been designed to overcome these limitations. It frees you from memory management problems by treating datasets as memory-mapped files, and from hard drive limits by streaming the entries in a corpus.

幸运的是,🤗数据集的设计就是为了克服这些限制。它将数据集视为内存映射文件,并通过对语料库中的条目进行流传输来摆脱硬盘限制,从而使您摆脱了内存管理问题。

In this section we’ll explore these features of 🤗 Datasets with a huge 825 GB corpus known as the Pile. Let’s get started!

在本部分中,我们将使用一个巨大的825G语料库(称为堆)来探索🤗数据集的这些特性。我们开始吧!

What is the Pile?

这堆东西是什么?

The Pile is an English text corpus that was created by EleutherAI for training large-scale language models. It includes a diverse range of datasets, spanning scientific articles, GitHub code repositories, and filtered web text. The training corpus is available in 14 GB chunks, and you can also download several of the individual components. Let’s start by taking a look at the PubMed Abstracts dataset, which is a corpus of abstracts from 15 million biomedical publications on PubMed. The dataset is in JSON Lines format and is compressed using the zstandard library, so first we need to install that:

该堆是EleutherAI为训练大规模语言模型而创建的英语文本语料库。它包括一系列不同的数据集、科学文章、GitHub代码库和经过过滤的网络文本。培训语料库以14 GB的块提供,您还可以下载几个单独的组件。让我们先来看看PubMed摘要数据集,它是来自PubMed上1500万篇生物医学出版物的摘要语料库。数据集为JSON Lines格式,并使用zStandard库进行压缩,因此我们首先需要安装:

1
!pip install zstandard

Next, we can load the dataset using the method for remote files that we learned in section 2:

接下来,我们可以使用在第2节中学习的远程文件方法加载数据集:

1
2
3
4
5
6
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset
1
2
3
4
Dataset({
features: ['meta', 'text'],
num_rows: 15518009
})

We can see that there are 15,518,009 rows and 2 columns in our dataset — that’s a lot!

我们可以看到,我们的数据集中有15,518,009行和2列–太多了!

✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass DownloadConfig(delete_extracted=True) to the download_config argument of load_dataset(). See the documentation for more details.

✎默认情况下,🤗数据集将解压缩加载数据集所需的文件。如果您想保留硬盘空间,可以将DownloadConfig(DELETE_EXTRACTED=True)传递给Load_DataSet()DownloadConfig参数。有关更多详细信息,请参阅文档。

Let’s inspect the contents of the first example:

让我们检查一下第一个示例的内容:

1
pubmed_dataset[0]
1
2
{'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

Okay, this looks like the abstract from a medical article. Now let’s see how much RAM we’ve used to load the dataset!

好的,这看起来像是医学文章的摘要。现在让我们来看看我们已经使用了多少RAM来加载数据集!

The magic of memory mapping

内存映射的魔力

A simple way to measure memory usage in Python is with the psutil library, which can be installed with pip as follows:

在Python中测量内存使用量的一个简单方法是使用psutil库,该库可以使用piap安装如下:

1
!pip install psutil

It provides a Process class that allows us to check the memory usage of the current process as follows:

它提供了一个Process类,让我们可以查看当前进程的内存使用情况,如下所示:

1
2
3
4
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")
1
RAM used: 5678.33 MB

Here the rss attribute refers to the resident set size, which is the fraction of memory that a process occupies in RAM. This measurement also includes the memory used by the Python interpreter and the libraries we’ve loaded, so the actual amount of memory used to load the dataset is a bit smaller. For comparison, let’s see how large the dataset is on disk, using the dataset_size attribute. Since the result is expressed in bytes like before, we need to manually convert it to gigabytes:

这里的rss‘属性是指驻留的集合大小,它是进程在RAM中占用的内存部分。此度量还包括由Python解释器使用的内存和我们加载的库,因此用于加载数据集的实际内存量要小一些。为了进行比较,我们使用DataSet_size`属性来看看数据集在磁盘上有多大。由于结果像以前一样以字节表示,因此我们需要手动将其转换为GB:

1
2
3
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")
1
2
Number of files in dataset : 20979437051
Dataset size (cache file) : 19.54 GB

Nice — despite it being almost 20 GB large, we’re able to load and access the dataset with much less RAM!

很好-尽管它几乎有20 GB大小,但我们能够用更少的RAM加载和访问数据集!

✏️ Try it out! Pick one of the subsets from the Pile that is larger than your laptop or desktop’s RAM, load it with 🤗 Datasets, and measure the amount of RAM used. Note that to get an accurate measurement, you’ll want to do this in a new process. You can find the decompressed sizes of each subset in Table 1 of the Pile paper.

✏️试试看吧!从比您的笔记本电脑或台式机的内存更大的堆中选择一个子集,用🤗数据集加载它,并测量使用的内存量。请注意,为了获得准确的测量结果,您需要在新的流程中执行此操作。您可以在纸堆的表1中找到每个子集的解压缩大小。

If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb that you typically need 5 to 10 times as much RAM as the size of your dataset. So how does 🤗 Datasets solve this memory management problem? 🤗 Datasets treats each dataset as a memory-mapped file, which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

如果你熟悉熊猫,这个结果可能会让你大吃一惊,因为Wes Kinney的著名经验法则是,你通常需要5到10倍于你的数据集大小的RAM。那么,🤗DataSet如何解决这个内存管理问题呢?🤗DataSets将每个数据集视为内存映射文件,提供内存和文件系统存储之间的映射,从而允许库访问和操作数据集的元素,而无需将其完全加载到内存中。

Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map() to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the Apache Arrow memory format and pyarrow library, which make the data loading and processing lightning fast. (For more details about Apache Arrow and comparisons to Pandas, check out Dejan Simic’s blog post.) To see this in action, let’s run a little speed test by iterating over all the elements in the PubMed Abstracts dataset:

内存映射文件还可以在多个进程之间共享,这使得Dataset.map()这样的方法可以并行化,而不需要移动或复制数据集。在幕后,这些能力都是通过阿帕奇Arrow内存格式和pyarrow库实现的,这使得数据加载和处理闪电般的快。(有关阿帕奇之箭的更多细节以及与熊猫的比较,请查看Dejan Simic的博客文章。)为了实际了解这一点,让我们通过迭代PubMed抽象数据集中的所有元素来运行一个小速度测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
_ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)
1
'Iterated over 15518009 examples (about 19.5 GB) in 64.2s, i.e. 0.304 GB/s'

Here we’ve used Python’s timeit module to measure the execution time taken by code_snippet. You’ll typically be able to iterate over a dataset at speed of a few tenths of a GB/s to several GB/s. This works great for the vast majority of applications, but sometimes you’ll have to work with a dataset that is too large to even store on your laptop’s hard drive. For example, if we tried to download the Pile in its entirety, we’d need 825 GB of free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let’s take a look at how this works.

这里我们使用了Python的timeit模块来测量code_snippet所花费的执行时间。你通常能够以几十分之一Gb/s到几Gb/s的速度迭代数据集。这对绝大多数应用程序都很有效,但有时你不得不处理一个太大的数据集,甚至无法存储在笔记本电脑的硬盘上。例如,如果我们尝试下载整个堆,则需要825 GB的空闲磁盘空间!为了处理这些情况,🤗DataSets提供了一个流功能,允许我们动态下载和访问元素,而不需要下载整个数据集。让我们来看看这是如何工作的。

💡 In Jupyter notebooks you can also time cells using the %%timeit magic function.

在Jupyter笔记本电脑中,你还可以使用‘%%💡’这个神奇的函数为单元格计时。

Streaming datasets

流数据集

To enable dataset streaming you just need to pass the streaming=True argument to the load_dataset() function. For example, let’s load the PubMed Abstracts dataset again, but in streaming mode:

要启用DataSet Streaming,只需将Streaming=True参数传递给Load_DataSet()函数。例如,让我们以流模式再次加载PubMed摘要数据集:

1
2
3
pubmed_dataset_streamed = load_dataset(
"json", data_files=data_files, split="train", streaming=True
)

Instead of the familiar Dataset that we’ve encountered elsewhere in this chapter, the object returned with streaming=True is an IterableDataset. As the name suggests, to access the elements of an IterableDataset we need to iterate over it. We can access the first element of our streamed dataset as follows:

与我们在本章其他地方遇到的熟悉的Dataset不同,Streaming=True返回的对象是一个IterableDataset。顾名思义,要访问`IterableDataset‘的元素,我们需要迭代它。我们可以按如下方式访问我们的数据流数据集的第一个元素:

1
next(iter(pubmed_dataset_streamed))
1
2
{'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

The elements from a streamed dataset can be processed on the fly using IterableDataset.map(), which is useful during training if you need to tokenize the inputs. The process is exactly the same as the one we used to tokenize our dataset in [Chapter 3], with the only difference being that outputs are returned one by one:

流数据集中的元素可以使用IterableDataset.map()动态处理,这在训练期间非常有用,如果您需要对输入进行标记化。该过程与我们在第3章中用来标记化数据集的过程完全相同,唯一的区别是逐个返回输出:

1
2
3
4
5
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))
1
{'input_ids': [101, 4958, 5178, 4328, 6779, ...], 'attention_mask': [1, 1, 1, 1, 1, ...]}

💡 To speed up tokenization with streaming you can pass batched=True, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the batch_size argument.

💡要通过流媒体加速标记化,您可以传递Batted=True,正如我们在上一节中所看到的。它将逐批处理示例;默认批大小为1,000,可以使用Batch_size参数指定。

You can also shuffle a streamed dataset using IterableDataset.shuffle(), but unlike Dataset.shuffle() this only shuffles the elements in a predefined buffer_size:

您也可以使用IterableDataset.Shuffle()对流数据集进行置乱,但与Dataset.Shuffle()不同的是,它只对预定义的Buffer_size中的元素进行置乱:

1
2
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))
1
2
{'meta': {'pmid': 11410799, 'language': 'eng'},
'text': 'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer ...'}

In this example, we selected a random example from the first 10,000 examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 10,001st example in the case above). You can also select elements from a streamed dataset using the IterableDataset.take() and IterableDataset.skip() functions, which act in a similar way to Dataset.select(). For example, to select the first 5 examples in the PubMed Abstracts dataset we can do the following:

在本例中,我们从缓冲区中的前10,000个示例中随机选择了一个示例。一旦一个例子被访问,它在缓冲区中的位置就被语料库中的下一个例子(即上面例子中的10,001个例子)填满。您还可以使用IterableDataset.take()IterableDataset.Skip()函数从流数据集中选择元素,这两个函数的作用类似于Dataset.select()。例如,要选择PubMed摘要数据集中的前5个示例,我们可以执行以下操作:

1
2
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)
1
2
3
4
5
6
7
8
9
10
[{'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
{'meta': {'pmid': 11409575, 'language': 'eng'},
'text': 'Clinical signs of hypoxaemia in children with acute lower respiratory infection: indicators of oxygen therapy ...'},
{'meta': {'pmid': 11409576, 'language': 'eng'},
'text': "Hypoxaemia in children with severe pneumonia in Papua New Guinea ..."},
{'meta': {'pmid': 11409577, 'language': 'eng'},
'text': 'Oxygen concentrators and cylinders ...'},
{'meta': {'pmid': 11409578, 'language': 'eng'},
'text': 'Oxygen supply in rural africa: a personal experience ...'}]

Similarly, you can use the IterableDataset.skip() function to create training and validation splits from a shuffled dataset as follows:

类似地,您可以使用IterableDataset.Skip()函数从混洗后的数据集创建训练和验证拆分,如下所示:

1
2
3
4
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

Let’s round out our exploration of dataset streaming with a common application: combining multiple datasets together to create a single corpus. 🤗 Datasets provides an interleave_datasets() function that converts a list of IterableDataset objects into a single IterableDataset, where the elements of the new dataset are obtained by alternating among the source examples. This function is especially useful when you’re trying to combine large datasets, so as an example let’s stream the FreeLaw subset of the Pile, which is a 51 GB dataset of legal opinions from US courts:

让我们用一个常见的应用程序来结束我们对数据集流的探索:将多个数据集组合在一起以创建单个语料库。🤗DataSets提供了一个Interave_DataSets()函数,该函数将一个IterableDataset对象列表转换为一个IterableDataset,其中新数据集的元素是通过交替使用源示例来获得的。当您尝试组合大型数据集时,此函数特别有用,因此,作为示例,让我们流传输该堆的Free Law子集,它是来自美国法院的51 GB的法律意见数据集:

1
2
3
4
5
6
7
law_dataset_streamed = load_dataset(
"json",
data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
split="train",
streaming=True,
)
next(iter(law_dataset_streamed))
1
2
3
4
{'meta': {'case_ID': '110921.json',
'case_jurisdiction': 'scotus.tar.gz',
'date_created': '2010-04-28T17:12:49Z'},
'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}

This dataset is large enough to stress the RAM of most laptops, yet we’ve been able to load and access it without breaking a sweat! Let’s now combine the examples from the FreeLaw and PubMed Abstracts datasets with the interleave_datasets() function:

这个数据集的大小足以给大多数笔记本电脑的RAM带来压力,但我们已经能够轻松地加载和访问它!现在,让我们将来自FreeLaw和PubMed摘要数据集的示例与interleave_dataets()函数组合在一起:

1
2
3
4
5
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))
1
2
3
4
5
6
[{'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'},
{'meta': {'case_ID': '110921.json',
'case_jurisdiction': 'scotus.tar.gz',
'date_created': '2010-04-28T17:12:49Z'},
'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}]

Here we’ve used the islice() function from Python’s itertools module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

在这里,我们使用了来自Python的itertools模块的islice()函数从组合数据集中选择前两个示例,我们可以看到它们与来自两个源数据集中的前两个示例相匹配。

Finally, if you want to stream the Pile in its 825 GB entirety, you can grab all the prepared files as follows:

最后,如果您想要对整个825 GB的文件进行流传输,您可以获取所有准备好的文件,如下所示:

1
2
3
4
5
6
7
8
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
"train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
"validation": base_url + "val.jsonl.zst",
"test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))
1
2
{'meta': {'pile_set_name': 'Pile-CC'},
'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web...'}

✏️ Try it out! Use one of the large Common Crawl corpora like mc4 or oscar to create a streaming multilingual dataset that represents the spoken proportions of languages in a country of your choice. For example, the four national languages in Switzerland are German, French, Italian, and Romansh, so you could try creating a Swiss corpus by sampling the Oscar subsets according to their spoken proportion.

✏️试试看吧!使用大型Common Crawl语料库之一,如mc4‘或oscar`,创建一个流媒体多语言数据集,代表您选择的国家/地区语言的口语比例。例如,瑞士的四种国家语言是德语、法语、意大利语和罗马语,因此您可以尝试创建瑞士语料库,方法是根据奥斯卡子集的口语比例对其进行采样。

You now have all the tools you need to load and process datasets of all shapes and sizes — but unless you’re exceptionally lucky, there will come a point in your NLP journey where you’ll have to actually create a dataset to solve the problem at hand. That’s the topic of the next section!

您现在拥有了加载和处理各种形状和大小的数据集所需的所有工具-但除非您特别幸运,否则在您的NLP旅程中,您将不得不实际创建一个数据集来解决手头的问题。这就是下一节的主题!