Skip to content

Latest commit

 

History

History

GENIE

GENIE

This repo provides the code and models for Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise.

🚀 Overview

In this paper, we introduce a novel dIffusion language modEl pre-training framework for text GENeration, which we call GENIE. GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence.

To pre-train GENIE on a large-scale language corpus, we design a novel pre-training method called continuous paragraph denoise (CPD), which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence.

You can find more details in the paper.

⚙️ Experiment Preparation

**Dependencies: **

  • python>=3.6
  • torch>=1.7.1
  • datasets>=1.12.1
  • transformers>=4.9.2 (Huggingface)
  • pyrouge==0.1.3

Downstream Task Dataset:

The text generation benchmarks we use is well-known and widely used, including XSum, CNN/DailyMail, and GigaWord. You can find more detailed information and obtain methods of the dataset here.

Model

We have released the checkpoint of the GENIE after pre-training on 160G corpus (6-layer encoder, and 6-layer decoder):

You can also quickly get the GENIE checkpoints fine-tuned on the XSum, CNN/DailyMail, and GigaWord here:

  • GENIE XSum [link]
  • GENIE CNN/DailyMail [link]
  • GENIE GigaWord [link]

We will continue to update and optimize this repo in the future.

💡 Pre-training

In the pre-training process, we use pre-training data consisting of 160Gb of news, books, stories, and web text. We trained GENIE V1 on 8 * 40G A100 for 50 days. If you are interested in our pre-training process, please refer to Genie_Pretrain.py. Here we provide the pre-training running script for reference:

OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/pretrain/data/path"

python -u -m torch.distributed.launch --nproc_per_node=8 --master_port=9489 \
./GENIE_main/Genie_Pretrain.py \
--checkpoint_path=$OUT_DIR \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" --model_arch="s2s_CAT" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" --training_mode="s2s" \
--schedule_sampler="uniform" --pre_max_len 512 --mask_pro 0.3 --seed 2023 \
--data_path=$DATA_PATH \
--batch_size 64 --lr 1e-04 --warmup_steps 300000 --train_type="S2S_Diffusion" \
--eval_interval 2000 --log_interval 2000 --save_interva 20000

The pre-training of diffusion model needs careful parameter adjustment and reasonable training configuration, especially the dimension of vocab size and input/output channel, which is worth our constant exploration.

⚽ Fine-tuning

In this section, we will use XSum dataset as an example to demonstrate the process of GENIE fine-tuning on downstream tasks. The running script for fine-tuning is as follows:

OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"
PRETRAIN_CKPT_PATH = "/Your/pretrain_ckpt/path"


python -u -m torch.distributed.launch --nproc_per_node=4 --master_port=9421 \
./GENIE_main/Genie_Finetune.py \
--checkpoint_path=$OUT_DIR \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" --model_arch="s2s_CAT" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" --training_mode="s2s" \
--schedule_sampler="loss-second-moment" --tgt_max_len 64 --src_max_len 512 --data_name=$DATA_NAME \
--data_path=$DATA_PATH \
--lr_anneal_steps 120000 --batch_size 64 --lr 5e-05 --warmup_steps 7200 --train_type="S2S_Diffusion" \
--eval_interval 200 --log_interval 200 --save_interva 20000 \
--pretrain_model_path=$PRETRAIN_CKPT_PATH

Important parameter setting:

  • --checkpoint_path: Location of model checkpoints and log file output after fine-tuning.
  • --data_path: Overall catalog of downstream task datasets.
  • --data_name: Name of downstream task dataset, Make sure your data is in the directory composed of "data_path + data_name", The directory needs to contain data files: train.src, train.tgt, dev.src, dev.tgt, test.src, test.tgt.
  • --pretrain_model_path: GENIE checkpoint path after pre-training.

If you need to replace the fine-tuning task, you just need to organize the data into the required form according to the standard format, such as CNN/DailyMail, and GigaWord, change DATA_NAME to cnndm_data or gigaword_data.

If you need to train from scratch (w/o pre-train), just remove the parameter --pretrain_model_path.

💬 Generate

In this section, we will show how to batch generate text from trained GENIE. We need to sample the Gaussian noise and iteratively denoise it with GENIE to restore the text. The running script for generating is as follows:

OUT_DIR = "/Your/output/path"
MODEL_DIR = "/Your/model/ckpt/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"

python -u -m torch.distributed.launch --nproc_per_node=8 --master_port=9498 \
./GENIE_main/Genie_Generate.py \
--generate_path=$OUT_DIR \
--eval_model_path=$MODEL_DIR \
--data_path=$DATA_PATH \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" \
--num_samples 5 --model_arch="s2s_CAT" --data_name=$DATA_NAME \
--training_mode="s2s" --tgt_max_len 64 --src_max_len 512 --batch_size=200 \
--interval_step 1 --seed 2023

Important parameter setting:

  • --generate_path: Output location of generated text.
  • --eval_model_path: Model checkpoint path needed for generation after training.
  • --data_path: Overall catalog of downstream task datasets
  • --data_name: Name of downstream task dataset, Make sure your data is in the directory composed of "data_path + data_name", The directory needs to contain data files: train.src, train.tgt, dev.src, dev.tgt, test.src, test.tgt.
  • --num_samples: The number of Gaussian noise samples per sample. (the number of text generated per sample)
  • --interval_step: Interval steps for denoise, default set to 1.

You can adjust --batch_size and parallel GPUs (--nproc_per_node) based on the performance of the device you are using. The name of the resulting text file is formatted as rank[gpu_id]_gen_seed_[seed]_num[num_samples]_epoch[sample epoch].txt.

Ultimately, we need to integrate the generated text, running the script as follows:

OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"

python ./GENIE_main/integration/eval_split.py \
--generate_path=$OUT_DIR \
--data_path=$DATA_PATH \
--num_samples 5 --data_name=$DATA_NAME --n_gpu 8 --seed 2023

Note that the above parameter settings need to be consistent with the generated parameter settings, and the optimal results will be saved in the --generate_path. If you want to reproduce the results in the paper, please use the GLGE official evaluation method here.

📜 Citation

Please cite our paper if you use GENIE in your work:

@article{lin2022genie,
  title = {Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise},
  author = {Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, Weizhu Chen},
  booktitle = {{arXiv}},
  year = {2022}
}