Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Masked Language Modelling #1030

Merged
merged 154 commits into from
Apr 10, 2020
Merged

Adding Masked Language Modelling #1030

merged 154 commits into from
Apr 10, 2020

Conversation

pruksmhc
Copy link
Contributor

@pruksmhc pruksmhc commented Mar 8, 2020

Adding Masked Language Modeling Task for RoBERTa and ALBERT
What this version of MLM supports: RoBERTa and ALBERT embedders.
Additionally, we fix the get pretrained_lm head function for Transformer-based embedders

Performance Tests:
We tested this on CCG and QQP, on both making sure that RoBERTa MLM + QQP -> target task and RoBERTa MLM + CCG -> target were getting reasonable numbers (meaning that MLM perplexity was decreasing and QQP/CCG performance should be close to RoBERTa without MLM training.

Task Task performance (best RoBERTa single finetuned on task) Task performance (multitask trained with MLM with RoBERTa) Perplexity after training
CCG 0.96 0.953 2.90563
QQP 0.898 0.853 2.45982
Target task Task Performance CCG w/o MLM performance
WIC 0.760 0.716
COPA 0.860 0.55
CB 0.820 0.791
RTE 0.834 0.794
CSenseQA 0.739
BoolQ 0.837 0.84
MultiRC 0.655 0.42
ReCoRD 0.82976 0.838
Cosmos 0.81 0.774

Comment on lines 184 to 194
@register_task("wikipedia_corpus_mlm", rel_path="wikipedia_corpus_small/")
class MaskedLanguageModelingTask(Task):
"""
Masked language modeling task on Wikipedia dataset
Attributes:
max_seq_len: (int) maximum sequence length
min_seq_len: (int) minimum sequence length
files_by_split: (dict) files for three data split (train, val, test)
We are currently using an unpreprocessed version of the Wikipedia corpus
that consists of 5% of the data. Please reach out to jiant admin if you
would like access to this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the task registered as wikipedia_corpus_mlm (using dataset under relative path /wikipedia_corpus_small) an experiment with some research significance that users would want to reproduce (or is it perhaps a toy dataset being used to demo MaskedLanguageModelingTask)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not toy data. The full Wikipedia is very large and jiant currently does not have ability to load the entire thing. So, I tried to extract a subset with the same size as Wikitext 103, which is around 5% of full Wikipedia.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, @phu-pmh. Maybe relevant: there was a recent discussion and change to another LM task to reduce its memory footprint. It may be relatively easy to make a similar change here to allow you to use the full dataset (if memory, not time, is the concern).

But if you want to introduce a modified dataset with this task, please also submit the code you used to construct the dataset. There's an example of a data preprocessing script under jiant/scripts. But all your script would need to do is document/link the data it takes as input, perform your preprocessing, and save the output as it's expected by your task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preprocessing code we used was a slight modification of what is here (@phu-pmh
can speak more to this), so it might not be the best idea to copy paste it into scripts. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data . Perhaps we could put instructions (maybe in the documentation in the Task class?) about how to generate it instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For English Wikipedia, I think the nvidia code is ready to use. I only added modifications so that we can do cross-lingual experiments.

"val": os.path.join(path, "valid.txt"),
"test": os.path.join(path, "test.txt"),
}
self.examples_by_split = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this in next commit

@pyeres
Copy link
Contributor

pyeres commented Apr 7, 2020

Thanks for your recent changes, @pruksmhc.

It looks like there are only a few open comments remaining:

  1. Providing data/script/instructions for the new task.
  2. Is correct_sent_indexing necessary with the final configuration of the task/data?
  3. I think we left our discussion of the token masking functionality with a plan to extract the masking logic into a single-purpose/documented/testable function/method.
  4. It looks like there are also a few smaller open comment threads (here and here).

Finally, as you suggested, we'll want to re-run your validation experiments after the changes are in.

@pruksmhc
Copy link
Contributor Author

pruksmhc commented Apr 7, 2020

@pyeres , for 3, I realized that 1. the masking code is a little hard to test because it randomly masks, which makes any asserts hard and 2. code is from the well-maintained Transformers library. Thus, I decided to hold off on unit tests for that part of the code. I'll still extract it out into its own function though, but just a heads up.
For 2, yes it is still necessary.

jiant/models.py Outdated
device=inputs.device, dtype=torch.uint8
)
tokenizer_name = self.sent_encoder._text_field_embedder.tokenizer_required
labels, _ = self.sent_encoder._text_field_embedder.correct_sent_indexing(
Copy link
Contributor

@pyeres pyeres Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the labels will be modified (as a result of correct_sent_indexing()). It looks like the inputs aren't getting the same adjustment here. Is that intentional/correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@pyeres pyeres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pruksmhc — thanks for these changes, and thanks especially for the creative testing work.

There are only a few open items to tie up before this is mergeable into master:

  1. A few open comments (most or all are on this most recent round of changes).
  2. Providing data/script/instructions for the new task
  3. Re-running your validation experiments after the final changes are in.

After that I think it's ready for approval.

jiant/tasks/lm.py Show resolved Hide resolved
tests/tasks/test_mlm.py Outdated Show resolved Hide resolved
jiant/tasks/lm.py Outdated Show resolved Hide resolved
jiant/tasks/lm.py Show resolved Hide resolved
jiant/tasks/lm.py Outdated Show resolved Hide resolved
files_by_split: (dict) files for three data split (train, val, test)
We are currently using an unpreprocessed version of the Wikipedia corpus
that consists of 5% of the data. You can generate the data using code from
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyeres here's the instructions for data generation (point 2 you raised).

Copy link
Contributor

@pyeres pyeres Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my request got buried a ways back in the thread — resurfacing here:

We need to make this Task reproducible, and to do that I don't know of a way to get around making the task's data dependencies reproducible. This can be done w/ a script or instructions (if the instructions are involved they should be in a script).

A script doesn't need to copy the functionality of other open source scripts involved in generating the data — our script can simply document that the other steps/scripts are used at some step. The goal is that using our script/instructions the user should be able to exactly reproduce the task's data dependencies. (@phu-pmh for visibility).

@pruksmhc
Copy link
Contributor Author

pruksmhc commented Apr 9, 2020

Here's the current validation checks. We do uniform mixing between MLM and the intermediate task, which means that the MLM perplexity here will be higher than what was reported in the description (which didn't use uniform mixing). We did uniform mixing so that the runs would finish faster.

Performance metric CCG w/ RoBERTa CCG w/ ALBERT QQP w/ RoBERTa QQP w/ ALBERT
Perplexity 4.060865 5.1527 10.98 8.529
Performance of interm task 0.9530 0.84 0.8227 0.76

Copy link
Contributor

@pyeres pyeres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pruksmhc and @phu-pmh!

@pruksmhc pruksmhc merged commit de3c44a into master Apr 10, 2020
phu-pmh added a commit that referenced this pull request Apr 17, 2020
* misc run scripts

* sbatch

* sweep scripts

* update

* qa

* update

* update

* update

* update

* update

* sb file

* moving update_metrics to outside scope of dataparallel

* fixing micro_avg calculation

* undo debugging

* Fixing tests, moving update_metrics out of other tasks

* remove extraneous change

* MLM task

* Added MLM task

* update

* fix multiple choice dataparallel forward

* update

* add _mask_id to transformers

* Update

* MLM update

* adding update_metrics abstraction

* delete update_metrics_ notation

* fixed wrong index problem

* removed unrelated files

* removed unrelated files

* removed unrelated files

* fix PEP8

* Fixed get_pretained_lm_head for BERT and ALBERT

* spelling check

* black formatting

* fixing tests

* bug fix

* Adding batch_size constraints to multi-GPU setting

* adding documentation

* adding batch size test

* black correct version

* Fixing batch size assertion

* generalize batch size assertion for more than 2 GPU setting

* reducing label loops in code

* fixing span forward

* Fixing span prediction forward for multi-GPU

* fix commonsenseQA forward

* MLM

* adding function documentation

* resolving nits, fixing seq_gen forward

* remove nit

* fixing batch_size assert and SpanPrediction task

* Remove debugging

* Fix batch size mismatch multi-GPU test

* Fix order of assert checking for batch size mismatch

* mlm training

* update

* sbatch

* update

* data parallel

* update data parallel stuffs

* using sequencelabel, using 1 paragraph per example

* update label mapping

* adding exmaples-porportion-mixing

* changing dataloader to work with wikitext103

* weight sampling

* add early stopping only onb one task

* commit

* Cleaning up code

* Removing unecessarily tracked git folders

* Removing unnecesary changes

* revert README

* revert README.md again

* Making more general for Transformer-based embedders

* torch.uint8 -> torch.bool

* Fixing indexing issues

* get rid of unecessary changes

* black cleanup

* update

* Prevent updating update_metrics twice in one step

* update

* update

* add base_roberta

* update

* reverting CCG edit added for debugging

* refactor defaults.conf

* black formatting

* merge

* removed SOP task and mlm_manual_scaling

* Fixing label namespace vocabulary creation, mergeing from master

* Deleting MLM weight

* black formatting

* Adding early_stopping_method to defaults.conf

* Fixing MLM with preprocessed wikitext103

* Deleting intermediate class hierarchy for MLM

* Correcting black

* LanguageModelingTask -> AutoregressiveModelingTask

* code style

* fixing MaskedLanguageModelTask

* Fixing typo

* Fixing label namespace

* extracting out masking portion

* Revert "extracting out masking portion"

This reverts commit f21165c.

* Code cleanup

* Adding tests for early_stpping_method

* Adding pretrain_stop_metric

* Reverting get_data_iter

* Reverting to get_data_iter

* Fixing get_pretrained_lm_head for all embedder types

* Extracting out MLM probability masking

* Move dynamic masking function to Task for easier testing

* Adding unit tests for MLM

* Adding change to MLM forward function to expose more intermediate steps for testing

* Fixing code style

* Adding more detailed instructions of how to generate Wikipedia data

* Adding rest of MLM data generation code

* Black style and remove comment

* black style

* updating repro code for MLM data

Co-authored-by: phu-pmh <[email protected]>
Co-authored-by: Haokun Liu <[email protected]>
Co-authored-by: pruksmhc <[email protected]>
Co-authored-by: DeepLearning VM <[email protected]>
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants