5-The_Datasets_library-0-Introduction

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter5/1?fw=pt

Introduction

引言

Ask a Question

问一个问题

In [Chapter 3] you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model:

在第3章中,您第一次体验了🤗数据集库,并了解到在微调模型时有三个主要步骤:

  1. Load a dataset from the Hugging Face Hub.
  2. Preprocess the data with Dataset.map().
  3. Load and compute metrics.

But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we’ll find answers to the following questions:

从Hugging Face中心加载一个数据集。使用Dataset.map()对数据进行预处理。加载和计算指标。但这只是🤗数据集所能做的事情的皮毛!在本章中,我们将深入探讨图书馆。在此过程中,我们将找到以下问题的答案:

  • What do you do when your dataset is not on the Hub?
  • How can you slice and dice a dataset? (And what if you really need to use Pandas?)
  • What do you do when your dataset is huge and will melt your laptop’s RAM?
  • What the heck are “memory mapping” and Apache Arrow?
  • How can you create your own dataset and push it to the Hub?

The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in [Chapter 6] and [Chapter 7] — so grab a coffee and let’s get started!

当您的数据集不在集线器上时,您会做什么?如何分割数据集?(如果你真的需要使用熊猫怎么办?)当你的数据集很大,会融化你笔记本电脑的RAM时,你该怎么办?“内存映射”和阿帕奇箭头到底是什么?你如何创建自己的数据集并将其推送到Hub?你在这里学到的技术将为第六章和第七章中的高级标记化和微调任务做好准备–所以,喝杯咖啡,让我们开始吧!