Skip to content

Latest commit

 

History

History
483 lines (354 loc) · 17.7 KB

multiple_choice.md

File metadata and controls

483 lines (354 loc) · 17.7 KB

多选题

[[open-in-colab]]

多选题类似于问答题,不同之处在于上下文中除了提供一个问题,还提供了若干个候选答案,模型的任务是选择正确答案。

本指南将向你展示如何进行以下操作:

  1. SWAG数据集的regular配置使用BERT进行微调,以选择最佳答案。
  2. 使用你微调的模型进行推理。
本教程中所示任务支持以下模型架构:

ALBERT, BERT, BigBird, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, FlauBERT, FNet, Funnel Transformer, I-BERT, Longformer, LUKE, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, Nezha, Nyströmformer, QDQBert, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

开始之前,请确保你已安装所有必需的库:

pip install transformers datasets evaluate

我们建议登录你的Hugging Face账户,这样你可以上传并与社区共享你的模型。在提示时,输入你的token登录:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载SWAG数据集

首先,从🤗数据集库加载SWAG数据集的regular配置:

>>> from datasets import load_dataset

>>> swag = load_dataset("swag", "regular")

然后查看一个示例:

>>> swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

尽管看起来字段很多,但实际上很简单:

  • sent1sent2:这些字段显示了句子的开头,并且如果将它们连接起来,你将得到startphrase字段。
  • ending:为句子的可能结尾提供了一些建议,但只有一个是正确答案。
  • label:标识正确的句子结尾。

预处理

接下来,加载BERT tokenizer来处理句子的开头和四个可能的结尾:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

你要创建的预处理函数需要:

  1. 复制sent1字段的四个副本,并将每个副本与sent2组合以重新创建句子的开头。
  2. sent2与四个可能的句子结尾组合。
  3. 扁平化这两个列表,以便对它们进行分词,然后在分词后重新给它们定义形状,使每个示例都有相应的input_idsattention_masklabels字段。
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]


>>> def preprocess_function(examples):
...     first_sentences = [[context] * 4 for context in examples["sent1"]]
...     question_headers = examples["sent2"]
...     second_sentences = [
...         [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
...     ]

...     first_sentences = sum(first_sentences, [])
...     second_sentences = sum(second_sentences, [])

...     tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
...     return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

使用🤗数据集的[~datasets.Dataset.map]方法将预处理函数应用于整个数据集,通过将batched=True设置为同时处理数据集的多个元素,可以加快map函数的处理速度:

tokenized_swag = swag.map(preprocess_function, batched=True)

🤗Transformers没有适用于多选题的数据整理器,因此你需要修改[DataCollatorWithPadding]以创建一批示例。在整理过程中,将句子动态填充到批处理中的最长长度,而不是将整个数据集填充到最大长度。

DataCollatorForMultipleChoice对所有模型输入进行扁平化、填充,然后恢复结果:

```py >>> from dataclasses import dataclass >>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy >>> from typing import Optional, Union >>> import torch

@dataclass ... class DataCollatorForMultipleChoice: ... """ ... Data collator that will dynamically pad the inputs for multiple choice received. ... """

... tokenizer: PreTrainedTokenizerBase ... padding: Union[bool, str, PaddingStrategy] = True ... max_length: Optional[int] = None ... pad_to_multiple_of: Optional[int] = None

... def call(self, features): ... label_name = "label" if "label" in features[0].keys() else "labels" ... labels = [feature.pop(label_name) for feature in features] ... batch_size = len(features) ... num_choices = len(features[0]["input_ids"]) ... flattened_features = [ ... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features ... ] ... flattened_features = sum(flattened_features, [])

... batch = self.tokenizer.pad( ... flattened_features, ... padding=self.padding, ... max_length=self.max_length, ... pad_to_multiple_of=self.pad_to_multiple_of, ... return_tensors="pt", ... )

... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} ... batch["labels"] = torch.tensor(labels, dtype=torch.int64) ... return batch

</pt>
<tf>
```py
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import tensorflow as tf


>>> @dataclass
... class DataCollatorForMultipleChoice:
...     """
...     Data collator that will dynamically pad the inputs for multiple choice received.
...     """

...     tokenizer: PreTrainedTokenizerBase
...     padding: Union[bool, str, PaddingStrategy] = True
...     max_length: Optional[int] = None
...     pad_to_multiple_of: Optional[int] = None

...     def __call__(self, features):
...         label_name = "label" if "label" in features[0].keys() else "labels"
...         labels = [feature.pop(label_name) for feature in features]
...         batch_size = len(features)
...         num_choices = len(features[0]["input_ids"])
...         flattened_features = [
...             [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
...         ]
...         flattened_features = sum(flattened_features, [])

...         batch = self.tokenizer.pad(
...             flattened_features,
...             padding=self.padding,
...             max_length=self.max_length,
...             pad_to_multiple_of=self.pad_to_multiple_of,
...             return_tensors="tf",
...         )

...         batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
...         batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
...         return batch

评估

在训练过程中包括一个指标通常有助于评估模型的性能。你可以使用🤗评估库快速加载一个评估方法。对于这个任务,加载accuracy指标(请参阅🤗Evaluate quick tour以了解更多有关加载和计算指标的信息):

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然后创建一个函数,将你的预测和标签传递给[~evaluate.EvaluationModule.compute]以计算准确性:

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

现在你的compute_metrics函数已经准备好了,当设置训练时将返回它。

训练

如果你对使用[Trainer]微调模型不熟悉,请查看这里的基本教程。

现在,你可以开始训练模型了!使用[AutoModelForMultipleChoice]加载BERT:

>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

>>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

此时,只剩下三个步骤:

  1. 在[TrainingArguments]中定义你的训练超参数。唯一需要的参数是output_dir,它指定保存你的模型的位置。通过设置push_to_hub=True,你将该模型上传到Hub(你需要登录Hugging Face以上传你的模型)。在每个epoch结束时,[Trainer]将评估准确性并保存训练检查点。
  2. 将训练参数与模型、数据集、tokenizer、数据整理器和compute_metrics函数一起传递给[Trainer]。
  3. 调用[~Trainer.train]进行微调。
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_swag_model",
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     learning_rate=5e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_swag["train"],
...     eval_dataset=tokenized_swag["validation"],
...     tokenizer=tokenizer,
...     data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

完成训练后,使用[~transformers.Trainer.push_to_hub]方法将模型推送到Hub,以便每个人都可以使用你的模型:

>>> trainer.push_to_hub()

如果你对使用Keras微调模型不熟悉,请查看这里的基本教程。

在TensorFlow中微调模型,首先设置一个优化器函数、学习率计划和一些训练超参数:
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_train_epochs = 2
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然后,使用[TFAutoModelForMultipleChoice]加载BERT:

>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

使用[~transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为tf.data.Dataset格式:

>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_swag["train"],
...     shuffle=True,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_swag["validation"],
...     shuffle=False,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

使用compile为训练配置模型。注意,Transformer模型都有一个默认的与任务相关的损失函数,因此你不需要指定损失函数,除非你想要使用其他的:

>>> model.compile(optimizer=optimizer)  # 没有损失参数!

在开始训练之前的最后两件事是从预测中计算准确性,并提供一种将模型上传到Hub的方法。这两个都可以使用Keras回调来完成。

compute_metrics函数传递给[~transformers.KerasMetricCallback]:

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在[~transformers.PushToHubCallback]中指定要推送模型和tokenizer的位置:

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(push_to_hub_model_id="your-model-id", push_to_hub_organization="your-organization")

运行model.fit开始训练:

>>> model.fit(
...     tf_train_set,
...     epochs=num_train_epochs,
...     callbacks=[metric_callback, push_to_hub_callback],
...     validation_data=tf_validation_set,
... )

一旦训练完成,使用push_to_hub_callback方法将你的模型和tokenizer推送到Hub,以便每个人都可以使用你的模型和tokenizer。

>>> model.push_to_hub(push_to_hub_organization="your-organization")
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_model",
...     tokenizer=tokenizer,
... )

然后将你的回调函数打包在一起:

>>> callbacks = [metric_callback, push_to_hub_callback]

最后,你准备好开始训练模型了!使用你的训练和验证数据集,指定训练轮数和回调函数来微调模型,调用 fit 方法:

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)

训练完成后,你的模型会自动上传到 Hub,这样每个人都可以使用它!

如果想要更深入地了解如何对模型进行多项选择的微调,请参考相应的PyTorch笔记本TensorFlow笔记本

推理

太好了,现在你已经对模型进行了微调,可以用它进行推理了!

编写一些文本和两个候选答案:

>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
>>> candidate1 = "The law does not apply to croissants and brioche."
>>> candidate2 = "The law applies to baguettes."
对每个提示和候选答案对进行标记化,并返回PyTorch张量。同时你还需要创建一些`labels`:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
>>> labels = torch.tensor(0).unsqueeze(0)

将输入数据和labels传递给模型,并返回logits

>>> from transformers import AutoModelForMultipleChoice

>>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
>>> logits = outputs.logits

获取具有最高概率的类别:

>>> predicted_class = logits.argmax().item()
>>> predicted_class
'0'
对每个提示和候选答案对进行标记化,并返回TensorFlow张量:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)

将输入数据传递给模型,并返回logits

>>> from transformers import TFAutoModelForMultipleChoice

>>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
>>> outputs = model(inputs)
>>> logits = outputs.logits

获取具有最高概率的类别:

>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
>>> predicted_class
'0'
```