Skip to content

A summarization dataset consisting of over 17k open access business journal articles.

License

Notifications You must be signed in to change notification settings

amanpreet692/Open4Business

Repository files navigation

Open4Business (O4B)

Code for the paper Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents accepted in the Workshop for Dataset Curation and Security at NeurIPS-2020.

A summarization dataset consisting of over 17k GOLD Open Access business journal articles.

The current version of the dataset can be downloaded from: O4B Download.

Steps to use the dataset:

  1. Download the zip from the URL given above and extract it.
  2. The extracted directory will contain 7 files - 1 source and 1 target file for each of the splits, namely train, dev and test. For instance, for training set the file names will be train.source and train.target. The additional file called refs.bib consist of the bibtex reference for the articles used for creating O4B.
  3. In both the source and target files, each line represents 1 record.
  4. These files can be used for training new summarization models directly!

For benchmarking experiments, following resources were used:

  1. Models from Hugging Face - T5-base and distillBART
  2. For benchmarking the above models use these steps. Please refer to HuggingFace documentation for any issues with fine-tuning the models.

For code re-use, refer this.

If you use the dataset/code please cite it as follows:

@misc{singh2020open4businesso4b,
      title={Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents}, 
      author={Amanpreet Singh and Niranjan Balasubramanian},
      year={2020},
      eprint={2011.07636},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

About

A summarization dataset consisting of over 17k open access business journal articles.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages