Trainer

Trainer
EnvTrainer

The Trainer class provides APIs for training with multiple parallel frameworks. The API supports distributed training with Pytorch DDP/Deepspeed on multiple GPUs, as well as mixed parallel distributed training with Megatron-LM+Deepspeed, and mixed precision via NVIDIA Apex.

Getting Started

Trainer includes basic training loops that support the above features. Two steps to use a Trainer: initialization and excution. Refer to the code in the directory examples/glm_superglue .

env_type in Trainer

There is a parameter env_type in Trainer that controls whether the training is distributed or not.

The enviroment type for training. Will default to 'pytorch'.
env_type: `pytorch`, `pytorchDDP`, `deepspeed`, `deepspeed+mpu`
            pytorch: single node cpu/gpu
            pytorchDDP: single-/multi- node gpu <data parallel>
            deepspeed: single-/multi- node gpu <data/pipline parallel>
            deepspeed+mpu: single-/multi- node gpu <data parallel + model parallel>

customize a Trainer

When using a custom model, when the input and output of the model are inconsistent with the behavior of the model in the FlagAI framework (refer to the introduction of the model forward function), a custom Trainer is required for training. To customize Trainer to quickly support custom models, you can inherent Trainer and override the forward_step method. Note: the return of the forward_step method is a dict

from flagai.trainer import Trainer

class MyTrainer(Trainer):

    def forward_step(self, data, model, mems):

        model_outputs = model(**data)
        output = {}
        output['loss'] = model_outputs.loss
        output['logits'] = model_outputs.logits
        output['hidden_states'] = model_outputs.decoder_hidden_states
        return output

Single node cpu/gpu

trainer = MyTrainer(
    env_type='pytorch',
    epochs=1,
    batch_size=4,
    eval_interval=100000,
    log_interval=10,
    experiment_name='t5-11b',
    pytorch_device='cpu',
    load_dir=None,
    lr=1e-4)

Specify the graphics card/cpu settings as pytorch_device, 'cpu', 'cuda:0', etc.

fp16

When the model parameters are large and the GPU memory space is very tight, the memory usage can be reduced by converting the fp32 parameters to fp16. FlagAI can realize such transformer with change of a parameter.

Model parameters turned to fp16
trainer = MyTrainer(
    env_type='pytorch',
    epochs=1,
    batch_size=1,
    eval_interval=10,
    log_interval=10,
    experiment_name='t5-11b',
    pytorch_device='cuda:0',
    load_dir=None,
    lr=1e-4,
    fp16=True) # change to `True`

Gradient recomputation

Do not save the Intermediate results in the forward stage. Paper: Training Deep Nets with Sublinear Memory Cost

Below, we give two examples of tens of billions of models to enable gradient recalculation, one is to load the 10-billion parameter model from FlagAI, and the other is to load the tens of billions model of huggingface.

GLM-10b-ch

#download model from modelhub and activate gradient recompuatation
from flagai.model.glm_model import GLMModel
model = GLMModel.from_pretrain(download_path="./state_dict", model_name="GLM-large-ch", checkpoint_activations=True)

huggingface t5-11b

t5-11b paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

from transformers import T5ForConditionalGeneration, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-11b')
model = T5ForConditionalGeneration.from_pretrained('t5-11b')
model.gradient_checkpointing = True

To demonstrate the scalability of FlagAI's Trainer, let's use the training T5-11b model as an example. Note: The weight of models with a scale of more than 10 billion is 20G+, and a single gpu >= V100 is required.

Training huggingface t5-11b example with Trainer

FlagAI example：examples/t5_huggingface

# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
from flagai.trainer import Trainer
from transformers import T5ForConditionalGeneration, T5Tokenizer
from torch.utils.data import Dataset
import torch


class MyTrainer(Trainer):

    def forward_step(self, data, model, mems):

        model_outputs = model(**data)
        output = {}
        output['loss'] = model_outputs.loss
        output['logits'] = model_outputs.logits
        output['hidden_states'] = model_outputs.decoder_hidden_states
        return output


trainer = MyTrainer(
    env_type='deepspeed',
    epochs=1,
    batch_size=1,
    eval_interval=10,
    log_interval=10,
    experiment_name='t5-11b',
    load_dir=None,
    lr=1e-4
    # parameters for pytorchDDP
    master_ip='127.0.0.1',
    master_port=17750,
    num_nodes=1,
    num_gpus=1,
    training_script=__file__,
    # deepspeed
    deepspeed_config='deepspeed.json'
)

model_name = 't5-11b'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.gradient_checkpointing = True

print("loading model & tokenizer is done!")
src_dir = './data/train.src'
tgt_dir = './data/train.tgt'
maxlen = 1024


def read_file():
    src = []
    tgt = []

    with open(src_dir, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            src.append(line.strip('\n').lower())

    with open(tgt_dir, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            tgt.append(line.strip('\n').lower())
    return src, tgt


class T5Seq2seqDataset(Dataset):

    def __init__(self, sents_src, sents_tgt, tokenizer, maxlen=512):
        super(T5Seq2seqDataset, self).__init__()
        self.sents_src = sents_src
        self.sents_tgt = sents_tgt
        self.tokenizer = tokenizer
        self.maxlen = maxlen

    def __getitem__(self, i):
        src = self.sents_src[i]
        tgt = self.sents_tgt[i]
        inputs = tokenizer(src)
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(tgt)
        output = {}
        output['input_ids'] = inputs.input_ids
        output['labels'] = labels.input_ids
        return output

    def __len__(self):
        return len(self.sents_src)


def seq2seq_collate_fn(batch):

    def padding(indice, max_length, pad_idx=0):

        pad_indice = [
            item + [pad_idx] * max(0, max_length - len(item))
            for item in indice
        ]
        return torch.tensor(pad_indice)

    token_ids = [data["input_ids"] for data in batch]
    max_length_tk = max([len(t) for t in token_ids])
    labels = [data["labels"] for data in batch]
    max_length_lb = max([len(t) for t in labels])

    token_ids_padded = padding(token_ids, max_length_tk)
    labels_padded = padding(labels, max_length_lb)

    data = {"input_ids": token_ids_padded, "labels": labels_padded}

    return data


sents_src, sents_tgt = read_file()
data_len = len(sents_tgt)
train_size = int(data_len * 0.8)
train_src = sents_src[:train_size][:200]
train_tgt = sents_tgt[:train_size][:200]

val_src = sents_src[train_size:]
val_tgt = sents_tgt[train_size:]

train_dataset = T5Seq2seqDataset(train_src,
                                 train_tgt,
                                 tokenizer=tokenizer,
                                 maxlen=maxlen)
val_dataset = T5Seq2seqDataset(val_src,
                               val_tgt,
                               tokenizer=tokenizer,
                               maxlen=maxlen)

trainer.train(model,
              train_dataset=train_dataset,
              collate_fn=seq2seq_collate_fn)

Parallel training

For speedup model training, FlagAI supports three types of paralleled training, but in the example of training T5-11b, only the the deepspeed framework can be used.

deepspeed

Deepspeed provides the cpu-offload optimizer , which can greatly reduce the occupation of gpu memory. The configuration file of deepspeed.json is as follows,

{
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 100,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": false,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e7,
    "allgather_bucket_size": 5e7,
    "cpu_offload": true 
  },
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0004,
      "weight_decay": 0.01,
      "betas": [
        0.9,
        0.98
      ],
      "eps": 1e-6
    }
  },
  "activation_checkpointing": {
    "partition_activations": false,
    "contiguous_memory_optimization": false
  },
  "wall_clock_breakdown": false
}

deepspeed_config can be find in examples/t5_huggingface/deepspeed.json. stage and cpu_offload are two key parameters.

hostfile can be is ignored in single node setting.

pytorchDDP

DistributedDataParallel (DDP) can be used when the size of model parameters <1 billion, e.g., t5-base. We can activate the framework by setting env_type = pytorchDDP.

trainer = MyTrainer(
    env_type='pytorchDDP',
    epochs=1,
    batch_size=1,
    eval_interval=10,
    log_interval=10,
    experiment_name='t5-base',
    load_dir=None,
    lr=1e-4
    # parameters for pytorchDDP
    master_ip='127.0.0.1',
    master_port=17750,
    num_nodes=1,
    num_gpus=1,
    hostfile='./hostfile',
    training_script=__file__,
)

deepspeed + megatron-lm

Now the 10-billion model GLM-10-ch adopts the model-parallel technology of Megatron-LM and the data-parallel technology of deepspeed. When the size of model parameters is above 10-billion, it is difficult to load a model and all the intermediate variables during training in a single gpu. To this end, Megatron-LM provides a model-parallel method. The main idea is to segment the matrix according to rows/columns. FlagAI converts the model to the Megatron-LM version. As follows, FlagAI support Megatron-LM version of models (GLM, T5, BERT [including RoBERTa], GPT2), as long as the environment variable is modified to deepspeed+mpu in the configuration file, the model-parallel version can be activated. For the huggingface models, there is no Megatron-LM support in FlagAI.

trainer = MyTrainer(
    env_type="deepspeed+mpu", # env_type
    epochs=1,
    batch_size=8,
    eval_interval=10,
    log_interval=10,
    experiment_name='GLM-10b-ch',
    load_dir=None,
    lr=1e-4,
    # parallel settings
    master_ip='127.0.0.1',
    master_port=17750,
    num_nodes=1,
    num_gpus=4,
    hostfile='hostfile',
    training_script=__file__,
    # deepspeed
    deepspeed_config='deepspeed.json',
    # megatron-lm
    model_paralle_size = 2
)

EnvTrainer

To input the parameters easier, we provided the EnvTrainer to replace the original Tranier.

Taking the code for example:

# train.py
import torch
from flagai.env_args import EnvArgs
from flagai.env_trainer import EnvTrainer

lr = 2e-5
n_epochs = 50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env_args = EnvArgs(
    env_type="pytorch",
    experiment_name="vit-cifar100-single_gpu",
    batch_size=150,
    num_gpus=1,
    gradient_accumulation_steps=1,
    lr=lr,
    weight_decay=1e-5,
    epochs=n_epochs,
    log_interval=100,
    eval_interval=1000,
    load_dir=None,
    pytorch_device=device,
    save_dir="checkpoints_vit_cifar100_single_gpu",
    save_interval=1000,
    num_checkpoints=1,
)

env_args.add_arg(arg_name="test1", default=0, type=int, )
env_args_parse = env_args.parse_args()
trainer = EnvTrainer(env_args)

When you run the train.py file, you can modify the input parameters through command line.

python train.py --batch_size=8 --epochs=10

If you need to add additional parameters, you can call the function:

env_args.add_arg(arg_name="test1", default=0, type=int, )

Then you can run the train.py file in the following command:

python train.py --test1=1

More examples in :

vit-env-trainer
glm-title-generation-env-trainer

Run with pytorchDDP launcher or deepspeed launcher

If you use multiple GPU to train models, you can run the train.py directly which to call the launcher in FlagAI Trainer.

python train.py

In addition, you also can use the pytorchDDP and deepspeed launcher to run, as example:

pytorchDDP

python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch

deepspeed

python -m deepspeed.launcher.launch  --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUTORIAL_4_TRAINER.md

TUTORIAL_4_TRAINER.md

Trainer

Getting Started

env_type in Trainer

customize a Trainer

Single node cpu/gpu

fp16

Gradient recomputation

GLM-10b-ch

huggingface t5-11b

Training huggingface t5-11b example with Trainer

Parallel training

deepspeed

pytorchDDP

deepspeed + megatron-lm

EnvTrainer

Run with pytorchDDP launcher or deepspeed launcher

pytorchDDP

deepspeed

Files

TUTORIAL_4_TRAINER.md

Latest commit

History

TUTORIAL_4_TRAINER.md

File metadata and controls

Trainer

Getting Started

env_type in Trainer

customize a Trainer

Single node cpu/gpu

fp16

Gradient recomputation

GLM-10b-ch

huggingface t5-11b

Training huggingface t5-11b example with Trainer

Parallel training

deepspeed

pytorchDDP

deepspeed + megatron-lm

EnvTrainer

Run with pytorchDDP launcher or deepspeed launcher

pytorchDDP

deepspeed