Skip to content

Commit

Permalink
Add electra modules (#1173)
Browse files Browse the repository at this point in the history
* Add electra modules

* Amend module versions of chinese-bert-wwm, chinese-bert-wwm-ext, rbt3 and rbtl3

* Updata demo README.md
  • Loading branch information
KPatr1ck committed Jan 6, 2021
1 parent 790e05d commit b68da8e
Show file tree
Hide file tree
Showing 27 changed files with 1,275 additions and 1,206 deletions.
13 changes: 9 additions & 4 deletions demo/sequence_labeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,10 @@ ERNIE, Chinese | `hub.Module(name='ernie')`
ERNIE tiny, Chinese | `hub.Module(name='ernie_tiny')`
ERNIE 2.0 Base, English | `hub.Module(name='ernie_v2_eng_base')`
ERNIE 2.0 Large, English | `hub.Module(name='ernie_v2_eng_large')`
BERT-Base, Cased | `hub.Module(name='bert-base-cased')`
BERT-Base, Uncased | `hub.Module(name='bert-base-uncased')`
BERT-Large, Cased | `hub.Module(name='bert-large-cased')`
BERT-Large, Uncased | `hub.Module(name='bert-large-uncased')`
BERT-Base, English Cased | `hub.Module(name='bert-base-cased')`
BERT-Base, English Uncased | `hub.Module(name='bert-base-uncased')`
BERT-Large, English Cased | `hub.Module(name='bert-large-cased')`
BERT-Large, English Uncased | `hub.Module(name='bert-large-uncased')`
BERT-Base, Multilingual Cased | `hub.Module(nane='bert-base-multilingual-cased')`
BERT-Base, Multilingual Uncased | `hub.Module(nane='bert-base-multilingual-uncased')`
BERT-Base, Chinese | `hub.Module(name='bert-base-chinese')`
Expand All @@ -73,6 +73,11 @@ RoBERTa-wwm-ext, Chinese | `hub.Module(name='roberta-wwm-ext')`
RoBERTa-wwm-ext-large, Chinese | `hub.Module(name='roberta-wwm-ext-large')`
RBT3, Chinese | `hub.Module(name='rbt3')`
RBTL3, Chinese | `hub.Module(name='rbtl3')`
ELECTRA-Small, English | `hub.Module(name='electra-small')`
ELECTRA-Base, English | `hub.Module(name='electra-base')`
ELECTRA-Large, English | `hub.Module(name='electra-large')`
ELECTRA-Base, Chinese | `hub.Module(name='chinese-electra-base')`
ELECTRA-Small, Chinese | `hub.Module(name='chinese-electra-small')`

通过以上的一行代码,`model`初始化为一个适用于序列标注任务的模型,为ERNIE Tiny的预训练模型后拼接上一个输出token共享的全连接网络(Full Connected)。
![](https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=224484727,3049769188&fm=15&gp=0.jpg)
Expand Down
13 changes: 9 additions & 4 deletions demo/text_classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,10 @@ ERNIE, Chinese | `hub.Module(name='ernie')`
ERNIE tiny, Chinese | `hub.Module(name='ernie_tiny')`
ERNIE 2.0 Base, English | `hub.Module(name='ernie_v2_eng_base')`
ERNIE 2.0 Large, English | `hub.Module(name='ernie_v2_eng_large')`
BERT-Base, Cased | `hub.Module(name='bert-base-cased')`
BERT-Base, Uncased | `hub.Module(name='bert-base-uncased')`
BERT-Large, Cased | `hub.Module(name='bert-large-cased')`
BERT-Large, Uncased | `hub.Module(name='bert-large-uncased')`
BERT-Base, English Cased | `hub.Module(name='bert-base-cased')`
BERT-Base, English Uncased | `hub.Module(name='bert-base-uncased')`
BERT-Large, English Cased | `hub.Module(name='bert-large-cased')`
BERT-Large, English Uncased | `hub.Module(name='bert-large-uncased')`
BERT-Base, Multilingual Cased | `hub.Module(nane='bert-base-multilingual-cased')`
BERT-Base, Multilingual Uncased | `hub.Module(nane='bert-base-multilingual-uncased')`
BERT-Base, Chinese | `hub.Module(name='bert-base-chinese')`
Expand All @@ -62,6 +62,11 @@ RoBERTa-wwm-ext, Chinese | `hub.Module(name='roberta-wwm-ext')`
RoBERTa-wwm-ext-large, Chinese | `hub.Module(name='roberta-wwm-ext-large')`
RBT3, Chinese | `hub.Module(name='rbt3')`
RBTL3, Chinese | `hub.Module(name='rbtl3')`
ELECTRA-Small, English | `hub.Module(name='electra-small')`
ELECTRA-Base, English | `hub.Module(name='electra-base')`
ELECTRA-Large, English | `hub.Module(name='electra-large')`
ELECTRA-Base, Chinese | `hub.Module(name='chinese-electra-base')`
ELECTRA-Small, Chinese | `hub.Module(name='chinese-electra-small')`

通过以上的一行代码,`model`初始化为一个适用于文本分类任务的模型,为ERNIE Tiny的预训练模型后拼接上一个全连接网络(Full Connected)。
![](https://ai-studio-static-online.cdn.bcebos.com/f9e1bf9d56c6412d939960f2e3767c2f13b93eab30554d738b137ab2b98e328c)
Expand Down
6 changes: 3 additions & 3 deletions modules/text/language_model/chinese_bert_wwm/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
```shell
$ hub install chinese-bert-wwm==2.0.1
$ hub install chinese-bert-wwm==2.0.0
```
<p align="center">
<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png" hspace='10'/> <br />
Expand Down Expand Up @@ -82,7 +82,7 @@ label_map = {0: 'negative', 1: 'positive'}

model = hub.Module(
name='chinese-bert-wwm',
version='2.0.1',
version='2.0.0',
task='seq-cls',
load_checkpoint='/path/to/parameters',
label_map=label_map)
Expand Down Expand Up @@ -153,6 +153,6 @@ paddlehub >= 2.0.0

初始发布

* 2.0.1
* 2.0.0

全面升级动态图,接口有所变化。任务名称调整,增加序列标注任务`token-cls`
2 changes: 1 addition & 1 deletion modules/text/language_model/chinese_bert_wwm/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

@moduleinfo(
name="chinese-bert-wwm",
version="2.0.1",
version="2.0.0",
summary=
"chinese-bert-wwm, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
author="ymcui",
Expand Down
6 changes: 3 additions & 3 deletions modules/text/language_model/chinese_bert_wwm_ext/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
```shell
$ hub install chinese-bert-wwm-ext==2.0.1
$ hub install chinese-bert-wwm-ext==2.0.0
```
<p align="center">
<img src="https://bj.bcebos.com/paddlehub/paddlehub-img/bert_network.png" hspace='10'/> <br />
Expand Down Expand Up @@ -82,7 +82,7 @@ label_map = {0: 'negative', 1: 'positive'}

model = hub.Module(
name='chinese-bert-wwm-ext',
version='2.0.1',
version='2.0.0',
task='seq-cls',
load_checkpoint='/path/to/parameters',
label_map=label_map)
Expand Down Expand Up @@ -153,6 +153,6 @@ paddlehub >= 2.0.0

初始发布

* 2.0.1
* 2.0.0

全面升级动态图,接口有所变化。任务名称调整,增加序列标注任务`token-cls`
2 changes: 1 addition & 1 deletion modules/text/language_model/chinese_bert_wwm_ext/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

@moduleinfo(
name="chinese-bert-wwm-ext",
version="2.0.1",
version="2.0.0",
summary=
"chinese-bert-wwm-ext, 12-layer, 768-hidden, 12-heads, 110M parameters. The module is executed as paddle.dygraph.",
author="ymcui",
Expand Down
150 changes: 94 additions & 56 deletions modules/text/language_model/chinese_electra_base/README.md
Original file line number Diff line number Diff line change
@@ -1,119 +1,157 @@
```shell
$ hub install chinese-electra-base==1.0.0
$ hub install chinese-electra-base==2.0.0
```

<p align="center">
<img src="https://github.com/ymcui/Chinese-ELECTRA/blob/master/pics/model.png" hspace='10'/> <br />
<img src="http://bj.bcebos.com/ibox-thumbnail98/1a5578bfbe1ad629035f7ad1eb3d0bce?authorization=bce-auth-v1%2Ffbe74140929444858491fbf2b6bc0935%2F2020-03-31T06%3A45%3A51Z%2F1800%2F%2F02b8749292f8ba1c606410d0e4e5dbabdf1d367d80da395887775d36424ac13e" hspace='10'/> <br />
</p>

更多详情请参考[ELECTRA论文](https://openreview.net/pdf?id=r1xMH1BtvB)

## API
```python
def context(
trainable=True,
max_seq_len=128
def __init__(
task=None,
load_checkpoint=None,
label_map=None,
num_classes=2,
**kwargs,
)
```
用于获取Module的上下文信息,得到输入、输出以及预训练的Paddle Program副本

**参数**

> trainable:设置为True时,Module中的参数在Fine-tune时也会随之训练,否则保持不变。
> max_seq_len:BERT模型的最大序列长度,若序列长度不足,会通过padding方式补到**max_seq_len**, 若序列长度大于该值,则会以截断方式让序列长度为**max_seq_len**,max_seq_len可取值范围为0~512;
创建Module对象(动态图组网版本)。

**返回**
> inputs:dict类型,有以下字段:
> >**input_ids**存放输入文本tokenize后各token对应BERT词汇表的word ids, shape为\[batch_size, max_seq_len\],int64类型;
> >**position_ids**存放输入文本tokenize后各token所在该文本的位置,shape为\[batch_size, max_seq_len\],int64类型;
> >**segment_ids**存放各token所在文本的标识(token属于文本1或者文本2),shape为\[batch_size, max_seq_len\],int64类型;
> >**input_mask**存放token是否为padding的标识,shape为\[batch_size, max_seq_len\],int64类型;
>
> outputs:dict类型,Module的输出特征,有以下字段:
> >**pooled_output**字段存放句子粒度的特征,可用于文本分类等任务,shape为 \[batch_size, 768\],int64类型;
> >**sequence_output**字段存放字粒度的特征,可用于序列标注等任务,shape为 \[batch_size, seq_len, 768\],int64类型;
>
> program:包含该Module计算图的Program。
**参数**

* `task`: 任务名称,可为`seq-cls`(文本分类任务,原来的`sequence_classification`在未来会被弃用)或`token-cls`(序列标注任务)。
* `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
* `label_map`:预测时的类别映射表。
* `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。
* `**kwargs`:用户额外指定的关键字字典类型的参数。

```python
def get_embedding(
texts,
use_gpu=False,
batch_size=1
def predict(
data,
max_seq_len=128,
batch_size=1,
use_gpu=False
)
```

用于获取输入文本的句子粒度特征与字粒度特征

**参数**

> texts:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。
> use_gpu:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。
* `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。
* `max_seq_len`:模型处理文本的最大长度
* `batch_size`:模型批处理大小
* `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。

**返回**

> results:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
>
* `results`:list类型,不同任务类型的返回结果如下
* 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\]
* 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\]

```python
def get_params_layer()
def get_embedding(
data,
use_gpu=False
)
```

用于获取参数层信息,该方法与ULMFiTStrategy联用可以严格按照层数设置分层学习率与逐层解冻。
用于获取输入文本的句子粒度特征与字粒度特征

**参数**

>
* `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。
* `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。

**返回**

> params_layer:dict类型,key为参数名,值为参数所在层数
* `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。


**代码示例**

```python
import paddlehub as hub

# Load $ hub install chinese-electra-base pretrained model
module = hub.Module(name="chinese-electra-base")
inputs, outputs, program = module.context(trainable=True, max_seq_len=128)
data = [
['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'],
['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'],
['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'],
]
label_map = {0: 'negative', 1: 'positive'}

model = hub.Module(
name='chinese-electra-base',
version='2.0.0',
task='seq-cls',
load_checkpoint='/path/to/parameters',
label_map=label_map)
results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)
for idx, text in enumerate(data):
print('Data: {} \t Lable: {}'.format(text, results[idx]))
```

详情可参考PaddleHub示例:
- [文本分类](https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/text_classification)
- [序列标注](https://github.com/PaddlePaddle/PaddleHub/tree/release/v2.0.0-beta/demo/sequence_labeling)

# Must feed all the tensor of chinese-electra-base's module need
input_ids = inputs["input_ids"]
position_ids = inputs["position_ids"]
segment_ids = inputs["segment_ids"]
input_mask = inputs["input_mask"]
## 服务部署

# Use "pooled_output" for sentence-level output.
pooled_output = outputs["pooled_output"]
PaddleHub Serving可以部署一个在线获取预训练词向量。

# Use "sequence_output" for token-level output.
sequence_output = outputs["sequence_output"]
### Step1: 启动PaddleHub Serving

# Use "get_embedding" to get embedding result.
embedding_result = module.get_embedding(texts=[["Sample1_text_a"],["Sample2_text_a","Sample2_text_b"]], use_gpu=True)
运行启动命令:

# Use "get_params_layer" to get params layer and used to ULMFiTStrategy.
params_layer = module.get_params_layer()
strategy = hub.finetune.strategy.ULMFiTStrategy(frz_params_layer=params_layer, dis_params_layer=params_layer)
```shell
$ hub serving start -m chinese-electra-base
```

这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。

**NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。

### Step2: 发送预测请求

配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果

```python
import requests
import json

# 指定用于获取embedding的文本[[text_1], [text_2], ... ]}
text = [["今天是个好日子"], ["天气预报说今天要下雨"]]
# 以key的方式指定text传入预测方法的时的参数,此例中为"data"
# 对应本地部署,则为module.get_embedding(data=text)
data = {"data": text}
# 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip
url = "http://10.12.121.132:8866/predict/chinese-electra-base"
# 指定post请求的headers为application/json方式
headers = {"Content-Type": "application/json"}

r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```

## 查看代码

https://github.com/ymcui/Chinese-ELECTRA



## 依赖

paddlepaddle >= 1.6.2
paddlepaddle >= 2.0.0

paddlehub >= 1.6.0
paddlehub >= 2.0.0

## 更新历史

* 1.0.0

初始发布

* 2.0.0

全面升级动态图,接口有所变化。任务名称调整,增加序列标注任务`token-cls`
Loading

0 comments on commit b68da8e

Please sign in to comment.