1-Transformer_models-7-Bias_and_limitations

中英文对照学习,效果更佳!
原课程链接:https://huggingface.co/course/chapter1/8?fw=pt

Bias and limitations

偏见和局限性

Ask a Question
Open In Colab
Open In Studio Lab
If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.

在Studio Lab中的Colab Open中提出问题如果您打算在生产中使用预先培训的模型或微调版本,请注意,虽然这些模型是强大的工具,但它们也有局限性。其中最大的问题是,为了能够对大量数据进行预培训,研究人员通常会收集他们能找到的所有内容,既包括互联网上可用的最好内容,也包括最差的内容。

To give a quick illustration, let’s go back the example of a fill-mask pipeline with the BERT model:

为了给出一个简单的说明,让我们回顾一下使用BERT模型的“填充-掩码”管道的示例:

1
2
3
4
5
6
7
8
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
1
2
['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
['nurse', 'waitress', 'teacher', 'maid', 'prostitute']

When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waitress). The others are work occupations usually associated with one specific gender — and yes, prostitute ended up in the top 5 possibilities the model associates with “woman” and “work.” This happens even though BERT is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it’s trained on the English Wikipedia and BookCorpus datasets).

当被要求填写这两句话中缺失的单词时,这位模特只给出了一个没有性别的答案(服务员/服务员)。其他职业通常与一个特定的性别相关–是的,妓女最终进入了模特认为“女人”和“工作”的五大可能性之列。尽管伯特是罕见的Transformer模型之一,但它不是通过从互联网上收集数据构建的,而是使用明显中性的数据(它是在英文维基百科和BookCorpus数据集上训练的)。

When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won’t make this intrinsic bias disappear.

因此,当你使用这些工具时,你需要记住你正在使用的原始模型很容易产生性别歧视、种族主义或仇视同性恋的内容。根据你的数据微调模型不会消除这种固有的偏见。