[[open-in-colab]]
多选题类似于问答题,不同之处在于上下文中除了提供一个问题,还提供了若干个候选答案,模型的任务是选择正确答案。
本指南将向你展示如何进行以下操作:
本教程中所示任务支持以下模型架构:ALBERT, BERT, BigBird, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, FlauBERT, FNet, Funnel Transformer, I-BERT, Longformer, LUKE, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, Nezha, Nyströmformer, QDQBert, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO
开始之前,请确保你已安装所有必需的库:
pip install transformers datasets evaluate
我们建议登录你的Hugging Face账户,这样你可以上传并与社区共享你的模型。在提示时,输入你的token登录:
>>> from huggingface_hub import notebook_login
>>> notebook_login()
首先,从🤗数据集库加载SWAG数据集的regular
配置:
>>> from datasets import load_dataset
>>> swag = load_dataset("swag", "regular")
然后查看一个示例:
>>> swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
'ending1': 'has heard approaching them.',
'ending2': "arrives and they're outside dancing and asleep.",
'ending3': 'turns the lead singer watches the performance.',
'fold-ind': '3416',
'gold-source': 'gold',
'label': 0,
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
'sent2': 'A drum line',
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
'video-id': 'anetv_jkn6uvmqwh4'}
尽管看起来字段很多,但实际上很简单:
sent1
和sent2
:这些字段显示了句子的开头,并且如果将它们连接起来,你将得到startphrase
字段。ending
:为句子的可能结尾提供了一些建议,但只有一个是正确答案。label
:标识正确的句子结尾。
接下来,加载BERT tokenizer来处理句子的开头和四个可能的结尾:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
你要创建的预处理函数需要:
- 复制
sent1
字段的四个副本,并将每个副本与sent2
组合以重新创建句子的开头。 - 将
sent2
与四个可能的句子结尾组合。 - 扁平化这两个列表,以便对它们进行分词,然后在分词后重新给它们定义形状,使每个示例都有相应的
input_ids
,attention_mask
和labels
字段。
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]
>>> def preprocess_function(examples):
... first_sentences = [[context] * 4 for context in examples["sent1"]]
... question_headers = examples["sent2"]
... second_sentences = [
... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
... ]
... first_sentences = sum(first_sentences, [])
... second_sentences = sum(second_sentences, [])
... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
使用🤗数据集的[~datasets.Dataset.map
]方法将预处理函数应用于整个数据集,通过将batched=True
设置为同时处理数据集的多个元素,可以加快map
函数的处理速度:
tokenized_swag = swag.map(preprocess_function, batched=True)
🤗Transformers没有适用于多选题的数据整理器,因此你需要修改[DataCollatorWithPadding
]以创建一批示例。在整理过程中,将句子动态填充到批处理中的最长长度,而不是将整个数据集填充到最大长度。
DataCollatorForMultipleChoice
对所有模型输入进行扁平化、填充,然后恢复结果:
@dataclass ... class DataCollatorForMultipleChoice: ... """ ... Data collator that will dynamically pad the inputs for multiple choice received. ... """
... tokenizer: PreTrainedTokenizerBase ... padding: Union[bool, str, PaddingStrategy] = True ... max_length: Optional[int] = None ... pad_to_multiple_of: Optional[int] = None
... def call(self, features): ... label_name = "label" if "label" in features[0].keys() else "labels" ... labels = [feature.pop(label_name) for feature in features] ... batch_size = len(features) ... num_choices = len(features[0]["input_ids"]) ... flattened_features = [ ... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features ... ] ... flattened_features = sum(flattened_features, [])
... batch = self.tokenizer.pad( ... flattened_features, ... padding=self.padding, ... max_length=self.max_length, ... pad_to_multiple_of=self.pad_to_multiple_of, ... return_tensors="pt", ... )
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} ... batch["labels"] = torch.tensor(labels, dtype=torch.int64) ... return batch
</pt>
<tf>
```py
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import tensorflow as tf
>>> @dataclass
... class DataCollatorForMultipleChoice:
... """
... Data collator that will dynamically pad the inputs for multiple choice received.
... """
... tokenizer: PreTrainedTokenizerBase
... padding: Union[bool, str, PaddingStrategy] = True
... max_length: Optional[int] = None
... pad_to_multiple_of: Optional[int] = None
... def __call__(self, features):
... label_name = "label" if "label" in features[0].keys() else "labels"
... labels = [feature.pop(label_name) for feature in features]
... batch_size = len(features)
... num_choices = len(features[0]["input_ids"])
... flattened_features = [
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
... ]
... flattened_features = sum(flattened_features, [])
... batch = self.tokenizer.pad(
... flattened_features,
... padding=self.padding,
... max_length=self.max_length,
... pad_to_multiple_of=self.pad_to_multiple_of,
... return_tensors="tf",
... )
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
... return batch
在训练过程中包括一个指标通常有助于评估模型的性能。你可以使用🤗评估库快速加载一个评估方法。对于这个任务,加载accuracy指标(请参阅🤗Evaluate quick tour以了解更多有关加载和计算指标的信息):
>>> import evaluate
>>> accuracy = evaluate.load("accuracy")
然后创建一个函数,将你的预测和标签传递给[~evaluate.EvaluationModule.compute
]以计算准确性:
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... predictions = np.argmax(predictions, axis=1)
... return accuracy.compute(predictions=predictions, references=labels)
现在你的compute_metrics
函数已经准备好了,当设置训练时将返回它。
如果你对使用[Trainer
]微调模型不熟悉,请查看这里的基本教程。
现在,你可以开始训练模型了!使用[AutoModelForMultipleChoice
]加载BERT:
>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
>>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
此时,只剩下三个步骤:
- 在[
TrainingArguments
]中定义你的训练超参数。唯一需要的参数是output_dir
,它指定保存你的模型的位置。通过设置push_to_hub=True
,你将该模型上传到Hub(你需要登录Hugging Face以上传你的模型)。在每个epoch结束时,[Trainer
]将评估准确性并保存训练检查点。 - 将训练参数与模型、数据集、tokenizer、数据整理器和
compute_metrics
函数一起传递给[Trainer
]。 - 调用[
~Trainer.train
]进行微调。
>>> training_args = TrainingArguments(
... output_dir="my_awesome_swag_model",
... evaluation_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... learning_rate=5e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=3,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_swag["train"],
... eval_dataset=tokenized_swag["validation"],
... tokenizer=tokenizer,
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
完成训练后,使用[~transformers.Trainer.push_to_hub
]方法将模型推送到Hub,以便每个人都可以使用你的模型:
>>> trainer.push_to_hub()
如果你对使用Keras微调模型不熟悉,请查看这里的基本教程。
在TensorFlow中微调模型,首先设置一个优化器函数、学习率计划和一些训练超参数:>>> from transformers import create_optimizer
>>> batch_size = 16
>>> num_train_epochs = 2
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
然后,使用[TFAutoModelForMultipleChoice
]加载BERT:
>>> from transformers import TFAutoModelForMultipleChoice
>>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
使用[~transformers.TFPreTrainedModel.prepare_tf_dataset
]将数据集转换为tf.data.Dataset
格式:
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
>>> tf_train_set = model.prepare_tf_dataset(
... tokenized_swag["train"],
... shuffle=True,
... batch_size=batch_size,
... collate_fn=data_collator,
... )
>>> tf_validation_set = model.prepare_tf_dataset(
... tokenized_swag["validation"],
... shuffle=False,
... batch_size=batch_size,
... collate_fn=data_collator,
... )
使用compile
为训练配置模型。注意,Transformer模型都有一个默认的与任务相关的损失函数,因此你不需要指定损失函数,除非你想要使用其他的:
>>> model.compile(optimizer=optimizer) # 没有损失参数!
在开始训练之前的最后两件事是从预测中计算准确性,并提供一种将模型上传到Hub的方法。这两个都可以使用Keras回调来完成。
将compute_metrics
函数传递给[~transformers.KerasMetricCallback
]:
>>> from transformers.keras_callbacks import KerasMetricCallback
>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在[~transformers.PushToHubCallback
]中指定要推送模型和tokenizer的位置:
>>> from transformers.keras_callbacks import PushToHubCallback
>>> push_to_hub_callback = PushToHubCallback(push_to_hub_model_id="your-model-id", push_to_hub_organization="your-organization")
运行model.fit
开始训练:
>>> model.fit(
... tf_train_set,
... epochs=num_train_epochs,
... callbacks=[metric_callback, push_to_hub_callback],
... validation_data=tf_validation_set,
... )
一旦训练完成,使用push_to_hub_callback
方法将你的模型和tokenizer推送到Hub,以便每个人都可以使用你的模型和tokenizer。
>>> model.push_to_hub(push_to_hub_organization="your-organization")
>>> push_to_hub_callback = PushToHubCallback(
... output_dir="my_awesome_model",
... tokenizer=tokenizer,
... )
然后将你的回调函数打包在一起:
>>> callbacks = [metric_callback, push_to_hub_callback]
最后,你准备好开始训练模型了!使用你的训练和验证数据集,指定训练轮数和回调函数来微调模型,调用 fit
方法:
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)
训练完成后,你的模型会自动上传到 Hub,这样每个人都可以使用它!
如果想要更深入地了解如何对模型进行多项选择的微调,请参考相应的PyTorch笔记本或TensorFlow笔记本。
太好了,现在你已经对模型进行了微调,可以用它进行推理了!
编写一些文本和两个候选答案:
>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
>>> candidate1 = "The law does not apply to croissants and brioche."
>>> candidate2 = "The law applies to baguettes."
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
>>> labels = torch.tensor(0).unsqueeze(0)
将输入数据和labels
传递给模型,并返回logits
:
>>> from transformers import AutoModelForMultipleChoice
>>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
>>> logits = outputs.logits
获取具有最高概率的类别:
>>> predicted_class = logits.argmax().item()
>>> predicted_class
'0'
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)
将输入数据传递给模型,并返回logits
:
>>> from transformers import TFAutoModelForMultipleChoice
>>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
>>> outputs = model(inputs)
>>> logits = outputs.logits
获取具有最高概率的类别:
>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
>>> predicted_class
'0'