Skip to content
/ DAMS Public

EMNLP-2021 paper: Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining.

License

Notifications You must be signed in to change notification settings

RowitZou/DAMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAMS

Pytorch implementation of the EMNLP-2021 paper: Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining.

Requirements

  • Python 3.7.10

  • pytorch 1.7.0+cu11.0

  • py-rouge 1.1

  • transformers 4.0.0

  • multiprocess 0.70.11.1

  • tensorboardX 2.1

  • torchtext 0.4.0

  • nltk 3.6.2

Environment

  • RTX 3090 GPU

  • CUDA 11.1

Data

All the datastes used in our work are available at Google Drive or Baidu Pan (extract code: wwsd), including the multi-source pretraining data and the dialogue summary data.

Usage

  • Download BERT checkpoints here and put BERT checkpoints into the directory bert like this:

     --- bert
       |
       |--- bert_base_uncased
          |
          |--- config.json
          |
          |--- pytorch_model.bin
          |
          |--- vocab.txt
    
  • Download json files from the above data links and put them into the directory json_data like this:

     --- json_data
       |
       |--- samsum
       |
       |--- adsc
       |
       ...
    
  • Pre-process dialogue summary datasets (e.g., the SAMSum training data).

    PYTHONPATH=. python ./src/preprocess.py -type train -raw_path json_data/samsum -save_path torch_data/samsum -log_file logs/json_to_data_samsum.log -truncated -n_cpus 4
    
  • Pre-process multi-source pretraining datasets and mix them up.

    PYTHONPATH=. python ./src/preprocess.py -raw_path json_data -save_path torch_data/all -log_file logs/json_to_data.log -truncated -n_cpus 40 -mix_up
    
  • Pretrain DAMS on the multi-source datasets.

    PYTHONPATH=. python ./src/main.py -mode train -data_path torch_data/all/data -model_path models/pretrain -log_file logs/pretrain.log -sep_optim -pretrain -visible_gpus 0,1 -pretrain_steps 250000 -port 10000
    
  • Fine-tune DAMS on the SAMSum training set.

    PYTHONPATH=. python ./src/main.py -mode train -data_path torch_data/samsum/samsum -model_path models/samsum -log_file logs/samsum.train.log -visible_gpus 0 -warmup_steps 1000 -lr 0.001 -train_from models/pretrain/model_step_250000.pt -train_from_ignore_optim -train_steps 50000
    
  • Validate DAMS on the SAMSum validation set.

    PYTHONPATH=. python ./src/main.py -mode validate -data_path torch_data/samsum/samsum -log_file logs/samsum.val.log -val_all -alpha 0.95 -model_path models/samsum -result_path results/samsum/samsum -visible_gpus 0 -min_length 15 -beam_size 3 -test_batch_ex_size 50
    
  • Test DAMS.

    Zero-shot test on the SAMSum test set using the pretrained model.

    PYTHONPATH=. python ./src/main.py -mode test -data_path torch_data/samsum/samsum -log_file logs/samsum.test.log -alpha 0.95 -test_from models/pretrain/model_step_250000.pt -result_path results/samsum/samsum -visible_gpus 0 -min_length 15 -beam_size 3 -test_batch_ex_size 50
    

    Regular test on the SAMSum test set using the best validated model.

    PYTHONPATH=. python ./src/main.py -mode test -data_path torch_data/samsum/samsum -log_file logs/samsum.test.log -alpha 0.95 -test_from models/samsum/model_step_xxx.pt -result_path results/samsum/samsum -visible_gpus 0 -min_length 15 -beam_size 3 -test_batch_ex_size 50
    

    Transfer to the ADSC test set.

    PYTHONPATH=. python ./src/main.py -mode test -data_path torch_data/adsc/adsc -log_file logs/adsc.test.log -alpha 0.95 -test_from models/samsum/model_step_xxx.pt -result_path results/adsc/adsc -visible_gpus 0 -min_length 100 -beam_size 3 -test_batch_ex_size 50
    

Citation

@inproceedings{
	zou-etal-2021-low,
	title = "Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining",
	author = "Zou, Yicheng  and Zhu, Bolin  and Hu, Xingwu  and Gui, Tao  and Zhang, Qi",
	booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
	month = nov,
	year = "2021",
	address = "Online and Punta Cana, Dominican Republic",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.emnlp-main.7",
	pages = "80--91"
}

About

EMNLP-2021 paper: Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages