4-Sharing_models_and_tokenizers-3-Building_a_model_card

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter4/4?fw=pt

Building a model card

打造一张模型卡

Ask a Question

问一个问题

The model card is a file which is arguably as important as the model and tokenizer files in a model repository. It is the central definition of the model, ensuring reusability by fellow community members and reproducibility of results, and providing a platform on which other members may build their artifacts.

模型卡是一个文件,可以说它与模型存储库中的模型和标记器文件一样重要。它是模型的中心定义,确保其他社区成员的可重用性和结果的重现性,并为其他成员提供一个构建其构件的平台。

Documenting the training and evaluation process helps others understand what to expect of a model — and providing sufficient information regarding the data that was used and the preprocessing and postprocessing that were done ensures that the limitations, biases, and contexts in which the model is and is not useful can be identified and understood.

记录培训和评估过程有助于其他人了解对模型的期望,并提供关于所使用的数据以及进行的前处理和后处理的充分信息,以确保可以识别和理解模型有用和不有用的限制、偏差和背景。

Therefore, creating a model card that clearly defines your model is a very important step. Here, we provide some tips that will help you with this. Creating the model card is done through the README.md file you saw earlier, which is a Markdown file.

因此,创建一张明确定义您的模型的模型卡是非常重要的一步。在这里,我们提供一些小贴士来帮助你做到这一点。创建模型卡是通过前面看到的Readme.md文件完成的,该文件是一个Markdown文件。

The “model card” concept originates from a research direction from Google, first shared in the paper “Model Cards for Model Reporting” by Margaret Mitchell et al. A lot of information contained here is based on that paper, and we recommend you take a look at it to understand why model cards are so important in a world that values reproducibility, reusability, and fairness.

“模特卡”的概念源于谷歌的一个研究方向,最初是在玛格丽特·米切尔等人的论文“模特卡的模型报道”中分享的。这里包含的许多信息都基于这篇论文,我们建议您查看一下它,以了解为什么在一个重视可重复性、可重用性和公平性的世界里,模型卡如此重要。

The model card usually starts with a very brief, high-level overview of what the model is for, followed by additional details in the following sections:

模型卡通常首先对模型的用途进行非常简短的概述,然后在以下部分中提供其他详细信息:

  • Model description
  • Intended uses & limitations
  • How to use
  • Limitations and bias
  • Training data
  • Training procedure
  • Evaluation results

Let’s take a look at what each of these sections should contain.

模型描述广泛使用和限制如何使用限制和偏差培训数据培训过程评估结果让我们看看每个部分都应该包含哪些内容。

Model description

型号说明

The model description provides basic details about the model. This includes the architecture, version, if it was introduced in a paper, if an original implementation is available, the author, and general information about the model. Any copyright should be attributed here. General information about training procedures, parameters, and important disclaimers can also be mentioned in this section.

型号说明提供了有关型号的基本详细信息。这包括体系结构、版本(如果是在论文中介绍的)、作者和有关模型的一般信息(如果有原始实现)。任何版权都应归于此处。本节还可介绍有关培训程序、参数和重要免责声明的一般信息。

Intended uses & limitations

预期用途和限制

Here you describe the use cases the model is intended for, including the languages, fields, and domains where it can be applied. This section of the model card can also document areas that are known to be out of scope for the model, or where it is likely to perform suboptimally.

在这里,您将描述该模型的预期用例,包括可以应用该模型的语言、字段和域。模型卡的这一部分还可以记录已知的超出模型范围的区域,或可能表现不佳的区域。

How to use

如何使用

This section should include some examples of how to use the model. This can showcase usage of the pipeline() function, usage of the model and tokenizer classes, and any other code you think might be helpful.

这一部分应该包括一些如何使用该模型的示例。这可以展示`Pipeline()‘函数的用法、模型和记号赋值器类的用法,以及您认为可能有帮助的任何其他代码。

Training data

训练数据

This part should indicate which dataset(s) the model was trained on. A brief description of the dataset(s) is also welcome.

此部分应指明模型训练的数据集。也欢迎对数据集的简要说明。

Training procedure

培训程序

In this section you should describe all the relevant aspects of training that are useful from a reproducibility perspective. This includes any preprocessing and postprocessing that were done on the data, as well as details such as the number of epochs the model was trained for, the batch size, the learning rate, and so on.

在本节中,您应该描述培训的所有相关方面,这些方面从可重现性的角度来看是有用的。这包括对数据进行的任何预处理和后处理,以及详细信息,如模型训练的纪元数、批大小、学习率等。

Variable and metrics

变量和指标

Here you should describe the metrics you use for evaluation, and the different factors you are mesuring. Mentioning which metric(s) were used, on which dataset and which dataset split, makes it easy to compare you model’s performance compared to that of other models. These should be informed by the previous sections, such as the intended users and use cases.

在这里,您应该描述您用于评估的指标,以及您正在测量的不同因素。提到使用了哪些指标、在哪个数据集和哪个数据集上拆分,可以很容易地将您的模型的性能与其他模型的性能进行比较。这些应该由前面的部分来告知,例如目标用户和用例。

Evaluation results

评估结果

Finally, provide an indication of how well the model performs on the evaluation dataset. If the model uses a decision threshold, either provide the decision threshold used in the evaluation, or provide details on evaluation at different thresholds for the intended uses.

最后,提供模型在评估数据集上执行情况的指示。如果模型使用决策阈值,则要么提供评估中使用的决策阈值,要么提供针对预期用途的不同阈值下的评估详细信息。

Example

示例

Check out the following for a few examples of well-crafted model cards:

以下是制作精良的卡片模型的几个例子:

More examples from different organizations and companies are available here.

`bert-base-casegpt2distilbert‘这里有来自不同组织和公司的更多示例。

Note

注意事项

Model cards are not a requirement when publishing models, and you don’t need to include all of the sections described above when you make one. However, explicit documentation of the model can only benefit future users, so we recommend that you fill in as many of the sections as possible to the best of your knowledge and ability.

在发布模型时,模型卡不是必需的,当您制作模型卡时,也不需要包括上面描述的所有部分。但是,模型的显式文档只会让未来的用户受益,因此我们建议您尽您的知识和能力填写尽可能多的部分。

Model card metadata

模型卡元数据

If you have done a little exploring of the Hugging Face Hub, you should have seen that some models belong to certain categories: you can filter them by tasks, languages, libraries, and more. The categories a model belongs to are identified according to the metadata you add in the model card header.

如果你对Hugging Face中心做了一些探索,你应该已经看到一些模型属于特定的类别:你可以根据任务、语言、库等来过滤它们。根据您在模型卡头中添加的元数据来标识模型所属的类别。

For example, if you take a look at the camembert-base model card, you should see the following lines in the model card header:

例如,如果您查看camembert-base模型卡,您应该会在模型卡头中看到以下几行:

1
2
3
4
5
6
---
language: fr
license: mit
datasets:
- oscar
---

This metadata is parsed by the Hugging Face Hub, which then identifies this model as being a French model, with an MIT license, trained on the Oscar dataset.

这些元数据由Hugging Face中心解析,然后识别出这个模特是法国模特,拥有麻省理工学院的许可证,在OSCAR数据集上接受培训。

The full model card specification allows specifying languages, licenses, tags, datasets, metrics, as well as the evaluation results the model obtained when training.

完整的模型卡规范允许指定语言、许可证、标签、数据集、指标以及模型在培训时获得的评估结果。