5-The_Datasets_library-1-What_if_my_dataset_isnt_on_the_Hub

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter5/2?fw=pt

What if my dataset isn’t on the Hub?

如果我的数据集不在集线器上怎么办?

Ask a Question
Open In Colab
Open In Studio Lab
You know how to use the Hugging Face Hub to download datasets, but you’ll often find yourself working with data that is stored either on your laptop or on a remote server. In this section we’ll show you how 🤗 Datasets can be used to load datasets that aren’t available on the Hugging Face Hub.

你知道如何使用Hugging Face中心下载数据集,但你经常会发现自己在处理存储在笔记本电脑或远程服务器上的数据。在本节中,我们将向您展示如何使用🤗数据集来加载在Hugging Face中心上不可用的数据集。

Working with local and remote datasets

使用本地和远程数据集

🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

🤗DataSets提供加载脚本来处理本地和远程数据集的加载。它支持几种常见的数据格式,例如:

Data format 数据格式 Loading script 正在加载脚本 Example 示例
CSV & TSV CSV和TSV csv `csv` load_dataset("csv", data_files="my_file.csv") `Load_Dataset(“csv”,data_files=“my_file.csv”)`
Text files 文本文件 text `文本` load_dataset("text", data_files="my_file.txt") `Load_Dataset(“Text”,data_files=“my_file.txt”)`
JSON & JSON Lines JSON和JSON LINE json `json` load_dataset("json", data_files="my_file.jsonl") `Load_Dataset(“json”,data_files=“my_file.jsonl”)`
Pickled DataFrames 腌制数据帧 pandas ‘熊猫’ load_dataset("pandas", data_files="my_dataframe.pkl") `Load_Dataset(“pandas”,data_files=“my_dataframe.pkl”)`

As shown in the table, for each data format we just need to specify the type of loading script in the load_dataset() function, along with a data_files argument that specifies the path to one or more files. Let’s start by loading a dataset from local files; later we’ll see how to do the same with remote files.

如表所示,对于每种数据格式,我们只需在Load_DataSet()函数中指定加载脚本的类型,以及指定一个或多个文件的路径的data_files参数。让我们首先从本地文件加载数据集;稍后我们将了解如何对远程文件执行相同的操作。

Loading a local dataset

加载本地数据集

For this example we’ll use the SQuAD-it dataset, which is a large-scale dataset for question answering in Italian.

在本例中,我们将使用Team-It数据集,这是一个大规模的意大利语问答数据集。

The training and test splits are hosted on GitHub, so we can download them with a simple wget command:

培训和测试拆分托管在GitHub上,所以我们可以通过一个简单的wget命令下载它们:

1
2
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

This will download two compressed files called SQuAD_it-train.json.gz and SQuAD_it-test.json.gz, which we can decompress with the Linux gzip command:

这将下载两个压缩文件,分别名为QUAND_IT-Train.json.gz和QUEAND_IT-Test.json.gz,我们可以使用Linuxgzip命令进行解压缩:

1
!gzip -dkv SQuAD_it-*.json.gz
1
2
SQuAD_it-test.json.gz:	   87.4% -- replaced with SQuAD_it-test.json
SQuAD_it-train.json.gz: 82.2% -- replaced with SQuAD_it-train.json

We can see that the compressed files have been replaced with SQuAD_it-train.json and SQuAD_it-test.json, and that the data is stored in the JSON format.

我们可以看到,压缩后的文件已被替换为QUAND_IT-Train.json和QUAND_IT-Test.json,并且数据以JSON格式存储。

✎ If you’re wondering why there’s a ! character in the above shell commands, that’s because we’re running them within a Jupyter notebook. Simply remove the prefix if you want to download and unzip the dataset within a terminal.

✎如果您想知道为什么上面的外壳命令中有字符,那是因为我们是在Jupyter笔记本中运行它们。如果要在终端中下载并解压缩数据集,只需删除前缀。

To load a JSON file with the load_dataset() function, we just need to know if we’re dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a data field. This means we can load the dataset by specifying the field argument as follows:

要使用Load_DataSet()函数加载JSON文件,我们只需要知道我们正在处理的是普通JSON(类似于嵌套字典)还是JSON Lines(行分隔的JSON)。像许多问答数据集一样,Team-它使用嵌套格式,所有文本都存储在一个data字段中。这意味着我们可以通过指定field参数来加载数据集,如下所示:

1
2
3
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

By default, loading local files creates a DatasetDict object with a train split. We can see this by inspecting the squad_it_dataset object:

默认情况下,加载本地文件会创建一个带有Train拆分的DatasetDict对象。我们可以通过查看Team_it_Dataset对象来查看:

1
squad_it_dataset
1
2
3
4
5
6
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
})

This shows us the number of rows and the column names associated with the training set. We can view one of the examples by indexing into the train split as follows:

这向我们显示了与训练集相关联的行数和列名。我们可以通过索引到Train拆分来查看其中一个示例,如下所示:

1
squad_it_dataset["train"][0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"title": "Terremoto del Sichuan del 2008",
"paragraphs": [
{
"context": "Il terremoto del Sichuan del 2008 o il terremoto...",
"qas": [
{
"answers": [{"answer_start": 29, "text": "2008"}],
"id": "56cdca7862d2951400fa6826",
"question": "In quale anno si è verificato il terremoto nel Sichuan?",
},
...
],
},
...
],
}

Great, we’ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the train and test splits in a single DatasetDict object so we can apply Dataset.map() functions across both splits at once. To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:

好极了,我们已经加载了我们的第一个本地数据集!但是,虽然这对训练集有效,但我们真正想要的是在单个DatasetDict对象中同时包括Traintest拆分,这样我们就可以同时对这两个拆分应用Dataset.map()函数。为此,我们可以为data_files参数提供一个字典,它将每个拆分名称映射到与该拆分相关联的文件:

1
2
3
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset
1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
test: Dataset({
features: ['title', 'paragraphs'],
num_rows: 48
})
})

This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

这正是我们想要的。现在,我们可以应用各种预处理技术来清理数据、标记化评论等。

The data_files argument of the load_dataset() function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"). See the 🤗 Datasets documentation for more details.

`Load_DataSet()函数的data_files参数非常灵活,可以是单个文件路径、文件路径列表或将分割名称映射到文件路径的字典。您还可以根据Unix外壳程序使用的规则来全局处理与指定模式匹配的文件(例如,您可以通过设置data_files=“*.json”`将目录中的所有JSON文件全局化为一个单独的拆分)。有关更多详细信息,请参阅🤗数据集文档。

The loading scripts in 🤗 Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the data_files argument directly to the compressed files:

🤗数据集中的加载脚本实际上支持输入文件的自动解压缩,所以我们可以通过直接将data_files参数指向压缩文件来跳过gzip的使用:

1
2
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

This can be useful if you don’t want to manually decompress many GZIP files. The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point data_files to the compressed files and you’re good to go!

如果您不想手动解压缩许多GZIP文件,这会很有用。自动解压缩也适用于其他常见的格式,如ZIP和TAR,所以你只需将data_files指向压缩文件,就可以了!

Now that you know how to load local files on your laptop or desktop, let’s take a look at loading remote files.

现在您已经知道如何在笔记本电脑或台式机上加载本地文件,让我们来看看如何加载远程文件。

Loading a remote dataset

加载远程数据集

If you’re working as a data scientist or coder in a company, there’s a good chance the datasets you want to analyze are stored on some remote server. Fortunately, loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point data_files to the SQuAD_it-.json.gz* URLs as follows:

如果您是一家公司的数据科学家或程序员,那么您想要分析的数据集很可能存储在某个远程服务器上。幸运的是,加载远程文件和加载本地文件一样简单!我们没有提供本地文件的路径,而是将Load_DataSet()data_files参数指向一个或多个存储远程文件的URL。例如,对于GitHub上托管的班组数据集,我们只需将data_files指向班组-it-.json.gz*URL,如下所示:

1
2
3
4
5
6
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
"train": url + "SQuAD_it-train.json.gz",
"test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

This returns the same DatasetDict object obtained above, but saves us the step of manually downloading and decompressing the SQuAD_it-.json.gz* files. This wraps up our foray into the various ways to load datasets that aren’t hosted on the Hugging Face Hub. Now that we’ve got a dataset to play with, let’s get our hands dirty with various data-wrangling techniques!

这将返回与上面获得的相同的DatasetDict对象,但省去了手动下载并解压缩teanit-.json.gz*文件的步骤。这结束了我们对加载未托管在Hugging Face中心上的数据集的各种方法的探索。现在我们已经有了一个数据集可以操作,让我们熟悉一下各种数据争论技术!

✏️ Try it out! Pick another dataset hosted on GitHub or the UCI Machine Learning Repository and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that’s stored in a CSV or text format (see the documentation for more information on these formats).

✏️试试看吧!选择托管在GitHub或UCI机器学习存储库上的另一个数据集,并尝试使用上面介绍的技术在本地和远程加载它。为了获得额外的积分,可以尝试加载以CSV或文本格式存储的数据集(有关这些格式的更多信息,请参阅文档)。