This repo provides the code and models for Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise.
In this paper, we introduce a novel dIffusion language modEl pre-training framework for text GENeration, which we call GENIE. GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence.
To pre-train GENIE on a large-scale language corpus, we design a novel pre-training method called continuous paragraph denoise (CPD), which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence.
You can find more details in the paper.
**Dependencies: **
- python>=3.6
- torch>=1.7.1
- datasets>=1.12.1
- transformers>=4.9.2 (Huggingface)
- pyrouge==0.1.3
Downstream Task Dataset:
The text generation benchmarks we use is well-known and widely used, including XSum, CNN/DailyMail, and GigaWord. You can find more detailed information and obtain methods of the dataset here.
Model
We have released the checkpoint of the GENIE after pre-training on 160G
corpus (6-layer encoder, and 6-layer decoder):
- GENIE V1 [link]
You can also quickly get the GENIE checkpoints fine-tuned on the XSum, CNN/DailyMail, and GigaWord here:
We will continue to update and optimize this repo in the future.
In the pre-training process, we use pre-training data consisting of 160Gb
of news, books, stories, and web text. We trained GENIE V1 on 8 * 40G
A100
for 50 days
. If you are interested in our pre-training process, please refer to Genie_Pretrain.py
. Here we provide the pre-training running script for reference:
OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/pretrain/data/path"
python -u -m torch.distributed.launch --nproc_per_node=8 --master_port=9489 \
./GENIE_main/Genie_Pretrain.py \
--checkpoint_path=$OUT_DIR \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" --model_arch="s2s_CAT" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" --training_mode="s2s" \
--schedule_sampler="uniform" --pre_max_len 512 --mask_pro 0.3 --seed 2023 \
--data_path=$DATA_PATH \
--batch_size 64 --lr 1e-04 --warmup_steps 300000 --train_type="S2S_Diffusion" \
--eval_interval 2000 --log_interval 2000 --save_interva 20000
The pre-training of diffusion model needs careful parameter adjustment and reasonable training configuration, especially the dimension of vocab size and input/output channel, which is worth our constant exploration.
In this section, we will use XSum dataset as an example to demonstrate the process of GENIE fine-tuning on downstream tasks. The running script for fine-tuning is as follows:
OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"
PRETRAIN_CKPT_PATH = "/Your/pretrain_ckpt/path"
python -u -m torch.distributed.launch --nproc_per_node=4 --master_port=9421 \
./GENIE_main/Genie_Finetune.py \
--checkpoint_path=$OUT_DIR \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" --model_arch="s2s_CAT" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" --training_mode="s2s" \
--schedule_sampler="loss-second-moment" --tgt_max_len 64 --src_max_len 512 --data_name=$DATA_NAME \
--data_path=$DATA_PATH \
--lr_anneal_steps 120000 --batch_size 64 --lr 5e-05 --warmup_steps 7200 --train_type="S2S_Diffusion" \
--eval_interval 200 --log_interval 200 --save_interva 20000 \
--pretrain_model_path=$PRETRAIN_CKPT_PATH
Important parameter setting:
--checkpoint_path
: Location of model checkpoints and log file output after fine-tuning.--data_path
: Overall catalog of downstream task datasets.--data_name
: Name of downstream task dataset, Make sure your data is in the directory composed of "data_path + data_name", The directory needs to contain data files:train.src
,train.tgt
,dev.src
,dev.tgt
,test.src
,test.tgt
.--pretrain_model_path
: GENIE checkpoint path after pre-training.
If you need to replace the fine-tuning task, you just need to organize the data into the required form according to the standard format, such as CNN/DailyMail, and GigaWord, change DATA_NAME
to cnndm_data
or gigaword_data
.
If you need to train from scratch (w/o pre-train), just remove the parameter --pretrain_model_path
.
In this section, we will show how to batch generate text from trained GENIE. We need to sample the Gaussian noise and iteratively denoise it with GENIE to restore the text. The running script for generating is as follows:
OUT_DIR = "/Your/output/path"
MODEL_DIR = "/Your/model/ckpt/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"
python -u -m torch.distributed.launch --nproc_per_node=8 --master_port=9498 \
./GENIE_main/Genie_Generate.py \
--generate_path=$OUT_DIR \
--eval_model_path=$MODEL_DIR \
--data_path=$DATA_PATH \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" \
--num_samples 5 --model_arch="s2s_CAT" --data_name=$DATA_NAME \
--training_mode="s2s" --tgt_max_len 64 --src_max_len 512 --batch_size=200 \
--interval_step 1 --seed 2023
Important parameter setting:
--generate_path
: Output location of generated text.--eval_model_path
: Model checkpoint path needed for generation after training.--data_path
: Overall catalog of downstream task datasets--data_name
: Name of downstream task dataset, Make sure your data is in the directory composed of "data_path + data_name", The directory needs to contain data files:train.src
,train.tgt
,dev.src
,dev.tgt
,test.src
,test.tgt
.--num_samples
: The number of Gaussian noise samples per sample. (the number of text generated per sample)--interval_step
: Interval steps for denoise, default set to 1.
You can adjust --batch_size
and parallel GPUs (--nproc_per_node
) based on the performance of the device you are using. The name of the resulting text file is formatted as rank[gpu_id]_gen_seed_[seed]_num[num_samples]_epoch[sample epoch].txt
.
Ultimately, we need to integrate the generated text, running the script as follows:
OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"
python ./GENIE_main/integration/eval_split.py \
--generate_path=$OUT_DIR \
--data_path=$DATA_PATH \
--num_samples 5 --data_name=$DATA_NAME --n_gpu 8 --seed 2023
Note that the above parameter settings need to be consistent with the generated parameter settings, and the optimal results will be saved in the --generate_path
. If you want to reproduce the results in the paper, please use the GLGE official evaluation method here.
Please cite our paper if you use GENIE in your work:
@article{lin2022genie,
title = {Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise},
author = {Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, Weizhu Chen},
booktitle = {{arXiv}},
year = {2022}
}