Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue#85 #86

Merged
merged 124 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
83fceae
saved workd
Anhforth Jun 29, 2022
f113f7c
saved workd
Anhforth Jun 29, 2022
708ce15
saved work on 6.29
Anhforth Jun 30, 2022
0d1b079
transformed tokenizer: progressing
Anhforth Jul 1, 2022
2763b5d
Opt 30b (#16)
920232796 Jul 1, 2022
3e52907
fix bert tokenizer issue (#18)
Anhforth Jul 1, 2022
deb2612
reconstruct the tokenizer structure
ZhaodongYan1 Jul 3, 2022
c2c6e9d
tested the new tokenizer
Anhforth Jul 4, 2022
fc2b5d8
removed some redundant codes and added sp model
Anhforth Jul 4, 2022
7da1757
updated the tokenizer
ZhaodongYan1 Jul 4, 2022
7c8c0b1
saved work
Anhforth Jul 5, 2022
3a0c8cb
Opt 66b (#19)
920232796 Jul 6, 2022
265d35a
saved work on 7.6
Anhforth Jul 6, 2022
4f8d715
updated release version
Anhforth Jul 6, 2022
efc1310
fix tokenizer issue
Anhforth Jul 6, 2022
59531e7
temp save
Anhforth Jul 6, 2022
3b6c16a
tokenizer test passed
Anhforth Jul 6, 2022
a7ff8f3
fixed some errors
Anhforth Jul 7, 2022
f4ff1a8
test of tokenizer transform
Anhforth Jul 7, 2022
811d9e9
fixed conflicts
Anhforth Jul 7, 2022
1406d89
fixed error
Anhforth Jul 7, 2022
b30eefa
add encode_plus
Anhforth Jul 8, 2022
9b81869
fix bug multi_gpu_training
920232796 Jul 8, 2022
7ad38a0
Merge pull request #21 from baai-open-internal/fix_multi_gpu_training
Anhforth Jul 8, 2022
72ffd6a
changed the version
Anhforth Jul 8, 2022
e6f89a6
fix_validation_bug (#24)
920232796 Jul 11, 2022
29ea850
updated the version
Anhforth Jul 11, 2022
4c68936
updated
Anhforth Jul 15, 2022
4834f23
modified encoder_plus
Anhforth Jul 15, 2022
8d44329
add vit and examples
920232796 Jul 15, 2022
81c438d
vit and examples
920232796 Jul 15, 2022
da24628
Update base_model.py
marscrazy Jul 15, 2022
aff728b
Update vit.py
marscrazy Jul 15, 2022
e5a0ddb
modify readme.md
920232796 Jul 15, 2022
fe56b8b
modify readme.md
920232796 Jul 15, 2022
fc6c32e
delete annotating code
920232796 Jul 15, 2022
cd45e5c
Vit xzh (#25)
920232796 Jul 15, 2022
5448084
updated
Anhforth Jul 17, 2022
eb555fc
updated
Anhforth Jul 17, 2022
9649aa4
performing tests on examples
Anhforth Jul 17, 2022
67c1288
finished example testing
Anhforth Jul 18, 2022
faee281
Merge branch 'develop' into vit_xzh
BAAI-OpenPlatform Jul 19, 2022
06f0b69
Merge pull request #28 from baai-open-internal/vit_xzh
BAAI-OpenPlatform Jul 19, 2022
deaa120
Merge pull request #27 from baai-open-internal/develop
marscrazy Jul 20, 2022
9558a47
env trainer
920232796 Jul 20, 2022
c35d4b6
Merge pull request #29 from baai-open-internal/env_args
marscrazy Jul 20, 2022
437caa4
vit-checkpoint-activations
920232796 Jul 21, 2022
dc6fc3d
vit-checkpoint-activations
920232796 Jul 21, 2022
c1cec9f
Merge pull request #33 from baai-open-internal/vit-checkpointing-acti…
marscrazy Jul 21, 2022
d74cf92
update
jongjyh Jul 25, 2022
044bc80
Merge pull request #34 from baai-open-internal/fix_eval_loss
marscrazy Jul 25, 2022
d85f8af
merged the master
Anhforth Jul 26, 2022
1b5ecc6
inference and train
wchh-2000 Jul 29, 2022
1fe6d3e
fix bug bert model
xuanricheng Aug 5, 2022
0c243d6
add autoloader and example training data
wchh-2000 Aug 15, 2022
2c28a7d
updated seq2seq
shunxing1234 Aug 16, 2022
e03247e
update
wchh-2000 Aug 16, 2022
4a4b003
Merge pull request #52 from baai-open-internal/add_clip
marscrazy Aug 17, 2022
ce5fd31
Merge branch 'master' into transform_tokenizer
Anhforth Aug 18, 2022
8353cd3
Update train.py
marscrazy Aug 18, 2022
5d5e135
Delete tst_superglue.py
marscrazy Aug 18, 2022
4c6ba56
updated according to comments
BAAI-OpenPlatform Aug 19, 2022
6076287
Merge pull request #50 from baai-open-internal/bert_model
BAAI-OpenPlatform Aug 19, 2022
c11e232
merged the clip tokenizer
BAAI-OpenPlatform Aug 22, 2022
6e135ef
merged clip tokenizer
BAAI-OpenPlatform Aug 23, 2022
fd06e4d
Update inference_clip.py
marscrazy Aug 25, 2022
b61b708
Update auto_loader.py
marscrazy Aug 25, 2022
25b659b
Update glm_10b_en_tokenizer.py
marscrazy Aug 25, 2022
8cffa38
Merge pull request #20 from baai-open-internal/transform_tokenizer
marscrazy Aug 25, 2022
9117f78
swinv1v2
920232796 Aug 25, 2022
f3186d9
Merge pull request #58 from baai-open-internal/swinv1v2_checkpoint_ac…
marscrazy Aug 25, 2022
4bd211d
updated the version
Anhforth Aug 25, 2022
6ef4190
updated the requirement packages list
Anhforth Aug 25, 2022
036e337
fixed some issues
BAAI-OpenPlatform Aug 26, 2022
edfd518
fixed some issues
BAAI-OpenPlatform Aug 26, 2022
497d709
tried to fix the data directory not found error
BAAI-OpenPlatform Aug 26, 2022
1ac43c0
fixed issues in running glm_seq2seq
BAAI-OpenPlatform Aug 26, 2022
351fba7
Update test_glm_seq2seq.py
marscrazy Aug 26, 2022
35b5d9a
Merge pull request #59 from baai-open-internal/fix_issues
marscrazy Aug 26, 2022
b5a14ed
fix glm tokenizer bug
920232796 Aug 29, 2022
9f786e0
fix a glm tokenizer bug
920232796 Aug 29, 2022
18c95e2
Update tokenizer.py
marscrazy Aug 29, 2022
56c081f
Merge branch 'master' into fix_glm_tokenizer
marscrazy Aug 29, 2022
c3c3569
Merge pull request #60 from baai-open-internal/fix_glm_tokenizer
marscrazy Aug 29, 2022
4ebd057
add news section in Readme
marscrazy Aug 31, 2022
90f410c
add news section in readme
marscrazy Aug 31, 2022
82c0faa
Merge pull request #61 from baai-open-internal/add_news_section
BAAI-OpenPlatform Aug 31, 2022
95d24a7
updated docs for tokenizer
Anhforth Aug 31, 2022
8be6804
Merge pull request #62 from baai-open-internal/update_docs
BAAI-OpenPlatform Sep 1, 2022
0b3340d
update required packages
Anhforth Sep 1, 2022
63ef585
Merge pull request #63 from baai-open-internal/add_section
BAAI-OpenPlatform Sep 1, 2022
40f0ccb
fix_issue_85
920232796 Sep 6, 2022
4bfadc5
fix_issue_85
920232796 Sep 6, 2022
53ee658
Merge pull request #64 from baai-open-internal/fix_issue_85
Anhforth Sep 6, 2022
f10049b
update version
Anhforth Sep 6, 2022
9293b88
fixed tokenizer
Anhforth Sep 7, 2022
0ec4280
Merge pull request #65 from baai-open-internal/fix_issue
BAAI-OpenPlatform Sep 7, 2022
6f53d35
Update README.md
marscrazy Sep 8, 2022
44f29e1
Update README.md
marscrazy Sep 8, 2022
c6b39e6
Update TUTORIAL_4_TRAINER.md
marscrazy Sep 8, 2022
4fe55d8
Update auto_loader.py
marscrazy Sep 8, 2022
c253242
Update bert_model.py
marscrazy Sep 8, 2022
53fc0d5
Update bert_model.py
marscrazy Sep 8, 2022
0ae5e6b
Update setup.py
marscrazy Sep 8, 2022
13491af
enable offline loading
Anhforth Sep 9, 2022
e0b9f90
Merge branch 'fix_issue' of github.com:FlagAI-Open/FlagAI into fix_issue
Anhforth Sep 9, 2022
f19aee8
fixed error in test_files and add robert ch tokenizer
Anhforth Sep 9, 2022
e34028e
updated docs
Anhforth Sep 9, 2022
4f56005
test
Anhforth Sep 9, 2022
cdb7219
fix error in action
Anhforth Sep 12, 2022
c9c5ad0
test runner
Anhforth Sep 12, 2022
b720a13
recovered
Anhforth Sep 12, 2022
2794029
enabled offline loading
Anhforth Sep 13, 2022
90e57aa
solved vit issue
Anhforth Sep 13, 2022
a6839e7
modified according to pr comments
Anhforth Sep 16, 2022
4c7736c
solved Filenotfounderror
Anhforth Sep 19, 2022
8b20fbf
recovered the previous model loading
Anhforth Sep 19, 2022
ce03958
updated the checkpoints
Anhforth Sep 19, 2022
860a3ac
removed loading checkpoints
Anhforth Sep 19, 2022
d1a342e
recovered unzipping checkpoints
Anhforth Sep 19, 2022
ebf792a
test_runner
Anhforth Sep 19, 2022
d68f222
recovered new checkpoints
Anhforth Sep 19, 2022
8b06a94
added comments
Anhforth Sep 20, 2022
1783a5d
Update train_deepspeed.py
marscrazy Sep 20, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensibl

The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers), [timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).

## News
- [29 Aug 2022] release v1.3.0, Added CLIP module and redesigned tokenizer apis in [#81](https://github.com/FlagAI-Open/FlagAI/pull/81)
- [21 Jul 2022] release v1.2.0, ViTs are supported in [#71](https://github.com/FlagAI-Open/FlagAI/pull/71)
- [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/finetuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63)
- [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1)

--------------------------------------------------------------------------------

<!-- toc -->

Expand Down
7 changes: 7 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@

本项目的部分代码基于 [GLM](https://github.com/THUDM/GLM),[Transformers](https://github.com/huggingface/transformers),[timm](https://github.com/rwightman/pytorch-image-models) 和 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).

## 动态
- [29 Aug 2022] release v1.3.0, Added CLIP module and redesigned tokenizer apis in [#81](https://github.com/FlagAI-Open/FlagAI/pull/81)
- [21 Jul 2022] release v1.2.0, ViTs are supported in [#71](https://github.com/FlagAI-Open/FlagAI/pull/71)
- [29 Jun 2022] release v1.1.0, support OPTs downloading and inference/finetuning [#63](https://github.com/FlagAI-Open/FlagAI/pull/63)
- [17 May 2022] made our first contribution in [#1](https://github.com/FlagAI-Open/FlagAI/pull/1)

--------------------------------------------------------------------------------
<!-- toc -->

- [安装](#安装)
Expand Down
33 changes: 13 additions & 20 deletions doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,17 +37,16 @@ GLM 对下游任务进行微调,并将它们重新定义为空白填充生成
```python
import torch
from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
model = GLMModel.from_pretrain(model_name=model_name,
download_path="./state_dict/")
tokenizer = Tokenizer.from_pretrained(model_name)
tokenizer = Tokenizer.from_pretrained(model_name, only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand All @@ -60,17 +59,14 @@ if __name__ == "__main__":
```python
import torch
from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
tokenizer = Tokenizer.from_pretrained(model_name)
model = GLMModel.from_pretrain(model_name=model_name, only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand All @@ -88,12 +84,9 @@ from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
tokenizer = Tokenizer.from_pretrained(model_name)
model = GLMModel.from_pretrain(model_name=model_name, only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand Down
46 changes: 5 additions & 41 deletions doc_zh/TUTORIAL_1_TOKENIZER.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,15 @@

值得注意的是,不同的分词器可以有不同的文本分割方式,并且有不同的词表文件, 相关算法的介绍可以在 [这里](tokenization.md) 查看。

目前我们支持下列七个分词器:

| 分词器 | 语言 |
|------------------------------|-----|
| GLMLargeEnWordPieceTokenizer | 英文 |
| GLMLargeChTokenizer | 中文 |
| GLM10bENBPETokenizer | 英文 |
| T5BPETokenizer | 中文 |
| ROBERTATokenizer | 中文 |
| BertWordPieceTokenizer | 中文 |
| CPMTokenizer | 中文 |


## 加载分词器
```python
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer
tokenizer = GLMLargeEnWordPieceTokenizer()
from flagai.data.tokenizer import Tokenizer
model_name = "GLM-large-ch"
tokenizer = Tokenizer.from_pretrained(model_name)
```
在这一步里,模型仓库中的词表文件将被自动下载到`cache_dir`参数中指定的路径。默认设置为分词器文件下的 ./vocab 目录。
在这一步里,模型仓库中的词表文件将被自动下载到`cache_dir`参数中指定的路径。默认设置为 `./checkpoints/{model_name}` 目录。


## 应用分词器
让我们使用一个分词器将原始文本编码成数字序列,然后将数字序列恢复成原始文本:
Expand All @@ -54,29 +44,3 @@ class T5BPETokenizer(Tokenizer):
cache_dir=cache_dir)
self.text_tokenizer.max_len = int(1e12)
```

### 3. 自定义分词器的接口
如果Hugging Face里的分词器不能满足您的需求,那么需要先准备好一份词表,然后手动实现下列函数的功能:

```python
def EncodeAsIds(self, text: str, process_fn=None):
"""输入文本 => 一个token序号列表"""

def EncodeAsTokens(self, text: str, process_fn=None):
"""输入文本 => 一个token列表"""

def IdToToken(self, Id: int):
"""Token序号 => token"""

def TokenToId(self, token: str):
"""Token => token序号"""
return self.text_tokenizer._convert_token_to_id(token)

def DecodeIds(self, Ids: list[int]):
"""一个token序号列表 => 对应的文本"""
return self.DecodeTokens([self.IdToToken(id) for id in Ids])

def DecodeTokens(self, tokens: list[str]):
"""一个token列表 => 对应的文本"""
return self.text_tokenizer.convert_tokens_to_string(tokens)
```
22 changes: 6 additions & 16 deletions doc_zh/TUTORIAL_2_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
### 分类任务应用代码
```python
import torch
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset import ConstructSuperglueStrategy
Expand All @@ -42,7 +42,7 @@ from flagai.data.dataset import ConstructSuperglueStrategy
cl_args = CollateArguments()

# 创建分词器
tokenizer = GLMLargeEnWordPieceTokenizer()
tokenizer = Tokenizer.from_pretrained("GLM-large-en")

# 初步读取并处理数据集
dataset = SuperGlueDataset(task_name='cb',
Expand Down Expand Up @@ -368,22 +368,15 @@ class ExamplePVP(PVP):
```
### 预训练的任务处理实例代码
```python
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.data.dataset import BlockDataset
from flagai.data.dataset.block.data_utils import split_ds, get_dataset_lazy, add_args
from flagai.test_utils import PretrainDatasetArguments

tokenizer = GLMLargeChTokenizer(add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=True)

ds_args = PretrainDatasetArguments()

tokenizer = GLMLargeChTokenizer(fix_command_token=True,
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False)
tokenizer = Tokenizer.from_pretrained("GLM-large-ch")

ds_args = add_args(ds_args, tokenizer)

Expand Down Expand Up @@ -432,18 +425,15 @@ datasets = create_dataset(tokenizer, should_split=True)
```python
import torch
from flagai.data.dataset import Seq2SeqDataset
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.test_utils import CollateArguments
from flagai.data.dataset import ConstructSeq2seqStrategy

# 得到默认参数
cl_args = Seq2SeqCollateArguments()

# 创建分词器
tokenizer = GLMLargeChTokenizer(add_block_symbols=True,
TUTORIAL_4_DATASET.md add_task_mask=False,
add_decoder_mask=False,
fix_command_token=False)
tokenizer = Tokenizer.from_pretrained("GLM-large-ch")

# 初步读取并处理数据集
dataset = Seq2SeqDataset(task_name='cmrc',
Expand Down
10 changes: 6 additions & 4 deletions doc_zh/TUTORIAL_4_TRAINER.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- [deepspeed](#deepspeed)
- [pytorchDDP](#pytorchddp)
- [deepspeed + megatron-lm](#deepspeed--megatron-lm)
- [EnvTrainer](#EnvTrainer)
- [EnvTrainer](#envtrainer)

Trainer 类提供了API用于多种并行框架的训练。API 支持在多个 GPU上使用Pytorch DDP/Deepspeed进行分布式训练,同时支持Megatron-LM+Deepspeed的混合并行分布式训练,同时也通过 NVIDIA Apex 实现混合精度。
## 入门
Expand Down Expand Up @@ -335,6 +335,7 @@ trainer = MyTrainer(
)
```


# EnvTrainer

为了更容易的输入参数,我们提供了EnvTrainer代替原来的Trainer
Expand Down Expand Up @@ -385,9 +386,10 @@ python train.py --test1=1
```
更多的例子可以查看 :

1. [vit-env-trainer](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)
1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)

# 使用 pytorchDDP launcher 或 deepspeed launcher 运行
如果你使用多个GPU来训练模型,你可以直接运行train.py来调用FlagAI训练器中的启动器。
Expand All @@ -402,4 +404,4 @@ python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 -
### deepspeed
```commandline
python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
```
```
35 changes: 14 additions & 21 deletions docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,16 @@ filling task
```python
import torch
from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
model = GLMModel.from_pretrain(model_name=model_name,
download_path="./state_dict/")
tokenizer = Tokenizer.from_pretrained(model_name)
tokenizer = Tokenizer.from_pretrained("GLM-large-ch", only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand All @@ -71,17 +70,14 @@ Similar to BERT, GLM can predict masked tokens as
```python
import torch
from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
tokenizer = Tokenizer.from_pretrained(model_name)
model = GLMModel.from_pretrain(model_name=model_name, only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand All @@ -94,17 +90,14 @@ and predict masked sentences as
```python
import torch
from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
tokenizer = GLMLargeChTokenizer(vocab_path='./checkpoints/glm-large-ch/cog-pretrain.model',
add_block_symbols=True,
add_task_mask=True,
add_decoder_mask=False,
fix_command_token=False)
model = GLMModel.from_pretrain(model_name='glm-large-ch', only_download_config=False)
model_name = 'GLM-large-ch'
tokenizer = Tokenizer.from_pretrained(model_name)
model = GLMModel.from_pretrain(model_name=model_name, only_download_config=False)
model.cuda(torch.cuda.current_device())
predictor = Predictor(model, tokenizer)
# question-answering
Expand Down
46 changes: 4 additions & 42 deletions docs/TUTORIAL_1_TOKENIZER.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,29 +17,16 @@ and have different vocabulary files.

[//]: # (An introduction to those algorithms can be viewed [here]&#40;tokenization.md&#41;.)

Our projects currently support six tokenizers
as listed below:

| Tokenizer | Language |
|------------------------------|----------|
| GLMLargeEnWordPieceTokenizer | English |
| GLMLargeChTokenizer | Chinese |
| GLM10bENBPETokenizer | English |
| T5BPETokenizer | Chinese |
| ROBERTATokenizer | Chinese |
| BertWordPieceTokenizer | Chinese |
| CPMTokenizer | Chinese |




## Loading a tokenizer
```python
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer

tokenizer = GLMLargeEnWordPieceTokenizer() # Load tokenizer
from flagai.data.tokenizer import Tokenizer
model_name = "GLM-large-en"
tokenizer = Tokenizer.from_pretrained(model_name) # Load tokenizer
```
At this step, the vocab files from Modelhub will be automatically downloaded to the path specified in `cache_dir` parameter. It is set to `./vocab` directory under the tokenizer file in default.
At this step, the vocab files from Modelhub will be automatically downloaded to the path specified in `cache_dir` parameter. It is set to `./checkpoints/{model_name}` directory in default.

## Applying a tokenizer
The tokenizer can be used to encode text to a list of token IDs, as well as decoding the token IDs to the original text.
Expand Down Expand Up @@ -68,28 +55,3 @@ class T5BPETokenizer(Tokenizer):
self.text_tokenizer.max_len = int(1e12)
```

### 3. Define Tokenizer APIs (without huggingface)
If huggingface tokenizers are not used, you need to implement the following class functions by your own.

```python
def EncodeAsIds(self, text: str, process_fn=None):
"""Input text string => a list of token ids"""

def EncodeAsTokens(self, text: str, process_fn=None):
"""Input text string => a list of tokens"""

def IdToToken(self, Id: int):
"""Token id => token"""

def TokenToId(self, token: str):
"""Token => token id"""
return self.text_tokenizer._convert_token_to_id(token)

def DecodeIds(self, Ids: list[int]):
"""A list of token ids => recovered text string"""
return self.DecodeTokens([self.IdToToken(id) for id in Ids])

def DecodeTokens(self, tokens: list[str]):
"""A list of tokens => recovered text string"""
return self.text_tokenizer.convert_tokens_to_string(tokens)
```
Loading