Skip to content

Commit

Permalink
env trainer
Browse files Browse the repository at this point in the history
Signed-off-by: zhaohu xing <[email protected]>
  • Loading branch information
920232796 committed Jul 20, 2022
1 parent fc6c32e commit 9558a47
Show file tree
Hide file tree
Showing 7 changed files with 1,411 additions and 172 deletions.
71 changes: 70 additions & 1 deletion doc_zh/TUTORIAL_4_TRAINER.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- [deepspeed](#deepspeed)
- [pytorchDDP](#pytorchddp)
- [deepspeed + megatron-lm](#deepspeed--megatron-lm)

- [EnvTrainer](#EnvTrainer)

Trainer 类提供了API用于多种并行框架的训练。API 支持在多个 GPU上使用Pytorch DDP/Deepspeed进行分布式训练,同时支持Megatron-LM+Deepspeed的混合并行分布式训练,同时也通过 NVIDIA Apex 实现混合精度。
## 入门
Expand Down Expand Up @@ -335,3 +335,72 @@ trainer = MyTrainer(
)
```

# EnvTrainer

为了更容易的输入参数,我们提供了EnvTrainer代替原来的Trainer
例如:
```python
# train.py
import torch
from flagai.env_args import EnvArgs
from flagai.env_trainer import EnvTrainer

lr = 2e-5
n_epochs = 50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env_args = EnvArgs(
env_type="pytorch",
experiment_name="vit-cifar100-single_gpu",
batch_size=150,
num_gpus=1,
gradient_accumulation_steps=1,
lr=lr,
weight_decay=1e-5,
epochs=n_epochs,
log_interval=100,
eval_interval=1000,
load_dir=None,
pytorch_device=device,
save_dir="checkpoints_vit_cifar100_single_gpu",
save_interval=1000,
num_checkpoints=1,
)

env_args.add_arg(arg_name="test1", default=0, type=int, )
env_args_parse = env_args.parse_args()
trainer = EnvTrainer(env_args)
```

运行train.py文件时,可以通过命令行修改输入参数。
```commandline
python train.py --batch_size=8 --epochs=10
```
如果你需要添加额外的参数,你可以调用这个函数:
```python
env_args.add_arg(arg_name="test1", default=0, type=int, )
```
然后你可以运行如下命令中的train.py文件:
```commandline
python train.py --test1=1
```
更多的例子可以查看 :

1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)


# 使用 pytorchDDP launcher 或 deepspeed launcher 运行
如果你使用多个GPU来训练模型,你可以直接运行train.py来调用FlagAI训练器中的启动器。
```commandline
python train.py
```
另外,你也可以使用pytorchDDP和deepspeed启动器来运行,例如:
### pytorchDDP
```commandline
python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
```
### deepspeed
```commandline
python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
```
76 changes: 76 additions & 0 deletions docs/TUTORIAL_4_TRAINER.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
- [deepspeed](#deepspeed)
- [pytorchDDP](#pytorchddp)
- [deepspeed + megatron-lm](#deepspeed--megatron-lm)
- [EnvTrainer](#EnvTrainer)


The Trainer class provides APIs for training with multiple parallel frameworks. The API supports distributed training with Pytorch DDP/Deepspeed on multiple GPUs, as well as mixed parallel distributed training with Megatron-LM+Deepspeed, and mixed precision via NVIDIA Apex.

## Getting Started
Expand Down Expand Up @@ -341,3 +344,76 @@ trainer = MyTrainer(
model_paralle_size = 2
)
```

# EnvTrainer

To input the parameters easier, we provided the EnvTrainer to replace the original Tranier.

Taking the code for example:
```python
# train.py
import torch
from flagai.env_args import EnvArgs
from flagai.env_trainer import EnvTrainer

lr = 2e-5
n_epochs = 50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env_args = EnvArgs(
env_type="pytorch",
experiment_name="vit-cifar100-single_gpu",
batch_size=150,
num_gpus=1,
gradient_accumulation_steps=1,
lr=lr,
weight_decay=1e-5,
epochs=n_epochs,
log_interval=100,
eval_interval=1000,
load_dir=None,
pytorch_device=device,
save_dir="checkpoints_vit_cifar100_single_gpu",
save_interval=1000,
num_checkpoints=1,
)

env_args.add_arg(arg_name="test1", default=0, type=int, )
env_args_parse = env_args.parse_args()
trainer = EnvTrainer(env_args)
```

When you run the train.py file, you can modify the input parameters through command line.
```commandline
python train.py --batch_size=8 --epochs=10
```
If you need to add additional parameters, you can call the function:
```python
env_args.add_arg(arg_name="test1", default=0, type=int, )
```
Then you can run the train.py file in the following command:
```commandline
python train.py --test1=1
```

More examples in :

1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)


# Run with pytorchDDP launcher or deepspeed launcher
If you use multiple GPU to train models, you can run the train.py directly which to call the launcher in FlagAI Trainer.
```commandline
python train.py
```
In addition, you also can use the pytorchDDP and deepspeed launcher to run, as example:

### pytorchDDP
```commandline
python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
```
### deepspeed
```commandline
python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
```
144 changes: 144 additions & 0 deletions examples/glm_title_generation/train_env_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import os
import numpy as np
import torch
from torch.utils.data import Dataset
from flagai.auto_model.auto_loader import AutoLoader
from flagai.env_trainer import EnvTrainer
from flagai.env_args import EnvArgs
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# You can input all parameters by the command line.
# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch
env_args = EnvArgs()
trainer = EnvTrainer(env_args)

cur_dir = os.path.dirname(os.path.abspath(__file__))
src_dir = cur_dir + '/data/train.src'
tgt_dir = cur_dir + '/data/train.tgt'

maxlen = 256
auto_loader = AutoLoader("lm",
model_name="GLM-large-ch",
model_dir="./state_dict/")
model = auto_loader.get_model()
tokenizer = auto_loader.get_tokenizer()

def read_file():
src = []
tgt = []

with open(src_dir, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
src.append(line.strip('\n').lower())

with open(tgt_dir, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
tgt.append(line.strip('\n').lower())

return src, tgt


class GLMSeq2seqDataset(Dataset):

def __init__(self,
sents_src,
sents_tgt,
tokenizer,
max_src_length=300,
max_tgt_length=200):
super(GLMSeq2seqDataset, self).__init__()
self.sents_src = sents_src
self.sents_tgt = sents_tgt
self.tokenizer = tokenizer
self.max_src_length = max_src_length
self.max_tgt_length = max_tgt_length
self.no_block_position = False

def __getitem__(self, i):
source_text = self.sents_src[i]
target_text = self.sents_tgt[i]
data = self.tokenizer.encode_plus(source_text, target_text)

return data

def __len__(self):

return len(self.sents_src)


class GLMPoetryDynamicCollateFN(): #padding process in each batch

def __init__(self, pad_id):
self.pad_id = pad_id

def pad_token(self, tokens, max_length):
pad_len = max_length - len(tokens)
tokens += [self.pad_id] * pad_len
return tokens

def pad_position_ids(self, position_ids, max_length):
pad_len = max_length - len(position_ids[0])
position_ids[0] += [len(position_ids[0]) + x for x in range(pad_len)]
position_ids[1] += [1] * pad_len
return position_ids

def pad_loss_mask(self, loss_mask, max_length):
pad_len = max_length - len(loss_mask)
loss_mask += [0] * pad_len
return loss_mask

def __call__(self, batch):
input_ids = [data["input_ids"] for data in batch]
target_ids = [data["target_ids"] for data in batch]
position_ids = [data["position_ids"] for data in batch]
attention_mask = [data['attention_mask'] for data in batch]
loss_mask = [data['loss_mask'] for data in batch]

max_length = max([len(t) for t in input_ids])
for i in range(len(input_ids)):
input_ids[i] = self.pad_token(input_ids[i], max_length)
target_ids[i] = self.pad_token(target_ids[i], max_length)
position_ids[i] = self.pad_position_ids(position_ids[i],
max_length)
loss_mask[i] = self.pad_loss_mask(loss_mask[i], max_length)
return {
'input_ids': torch.LongTensor(input_ids),
'labels': torch.LongTensor(target_ids),
'position_ids': torch.LongTensor(position_ids),
'attention_mask': torch.LongTensor(attention_mask),
'loss_mask': torch.LongTensor(loss_mask)
}


sents_src, sents_tgt = read_file()
my_collate_fn = GLMPoetryDynamicCollateFN(
pad_id=tokenizer.get_command('pad').Id)

data_len = len(sents_tgt)
train_size = int(data_len * 0.8)
train_src = sents_src[:train_size][:2000]
train_tgt = sents_tgt[:train_size][:2000]

val_src = sents_src[train_size:]
val_tgt = sents_tgt[train_size:]

train_dataset = GLMSeq2seqDataset(train_src,
train_tgt,
tokenizer=tokenizer,
max_src_length=300,
max_tgt_length=200)
val_dataset = GLMSeq2seqDataset(val_src,
val_tgt,
tokenizer=tokenizer,
max_src_length=300,
max_tgt_length=200)

trainer.train(model,
train_dataset=train_dataset,
valid_dataset=val_dataset,
collate_fn=my_collate_fn)
Loading

0 comments on commit 9558a47

Please sign in to comment.