Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Masked Language Modelling #1030

Merged
merged 154 commits into from
Apr 10, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
430f942
misc run scripts
phu-pmh Oct 30, 2019
39603c3
sbatch
phu-pmh Oct 31, 2019
9b324f9
sweep scripts
phu-pmh Nov 4, 2019
d3cc769
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 5, 2019
00bc40c
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 9, 2019
4e297b1
update
phu-pmh Nov 9, 2019
b75d0f5
qa
phu-pmh Nov 10, 2019
1aadf48
update
phu-pmh Nov 10, 2019
8993b9e
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 10, 2019
a3f10e2
update
phu-pmh Nov 13, 2019
aa0d8b4
update
phu-pmh Nov 13, 2019
275d7a3
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 13, 2019
4b6b939
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 16, 2019
7252ea5
update
phu-pmh Nov 16, 2019
f0d9c56
update
phu-pmh Nov 20, 2019
00223c6
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Nov 27, 2019
b0a8ec3
sb file
phu-pmh Dec 12, 2019
c4d2601
moving update_metrics to outside scope of dataparallel
Jan 14, 2020
acb9d24
fixing micro_avg calculation
Jan 16, 2020
8bdec95
undo debugging
Jan 16, 2020
0d879b1
Merge branch 'master' of https://github.com/nyu-mll/jiant
phu-pmh Jan 17, 2020
4f0a169
Merge branch 'master' into fix_dataparallel_metric_calculation
Jan 17, 2020
5bb8389
Fixing tests, moving update_metrics out of other tasks
Jan 17, 2020
fb59ecc
Merge branch 'master' of https://github.com/nyu-mll/jiant into fix_da…
Jan 17, 2020
04dbbda
Merge branch 'fix_dataparallel_metric_calculation' of https://github.…
Jan 17, 2020
3ddf564
remove extraneous change
Jan 17, 2020
e588909
MLM task
phu-pmh Jan 21, 2020
dfa9fd9
Added MLM task
phu-pmh Jan 21, 2020
46182a9
update
phu-pmh Jan 24, 2020
607bcd2
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
phu-pmh Jan 24, 2020
d1daf23
fix multiple choice dataparallel forward
Jan 25, 2020
9539302
Merge branch 'master' into fix_dataparallel_metric_calculation
Jan 25, 2020
fc5f026
update
phu-pmh Jan 27, 2020
ce7f5c2
add _mask_id to transformers
HaokunLiu Jan 28, 2020
ffc7354
Update
phu-pmh Jan 30, 2020
c50d75b
Merge branch 'master' of https://github.com/nyu-mll/jiant into MLM
phu-pmh Jan 30, 2020
9649224
Merge branch 'master' into fix_dataparallel_metric_calculation
Jan 30, 2020
69a9364
MLM update
phu-pmh Jan 30, 2020
697d62c
Merge branch 'add-_mask_id-to-transformers' into MLM
HaokunLiu Jan 30, 2020
a4666da
adding update_metrics abstraction
Jan 30, 2020
fa13f6f
delete update_metrics_ notation
Jan 30, 2020
6b61e8b
fixed wrong index problem
phu-pmh Jan 30, 2020
3e10e3b
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
phu-pmh Jan 30, 2020
afc0938
removed unrelated files
phu-pmh Jan 31, 2020
dcff7e7
removed unrelated files
phu-pmh Jan 31, 2020
1c1e6fb
removed unrelated files
phu-pmh Jan 31, 2020
f25ee99
fix PEP8
phu-pmh Jan 31, 2020
3f35212
Fixed get_pretained_lm_head for BERT and ALBERT
phu-pmh Jan 31, 2020
fc85270
spelling check
Feb 1, 2020
321bda8
black formatting
Feb 1, 2020
ae92b78
fixing tests
Feb 2, 2020
4f36878
bug fix
phu-pmh Feb 3, 2020
0467871
Adding batch_size constraints to multi-GPU setting
Feb 5, 2020
e3c5c79
adding documentation
Feb 5, 2020
6e96fd0
adding batch size test
Feb 5, 2020
845bf4f
Merge branch 'master' of https://github.com/nyu-mll/jiant into fix_da…
Feb 5, 2020
b41c268
black correct version
Feb 5, 2020
6f82412
Fixing batch size assertion
Feb 5, 2020
c749ea7
generalize batch size assertion for more than 2 GPU setting
Feb 5, 2020
73222a5
reducing label loops in code
Feb 6, 2020
fe39525
fixing span forward
Feb 8, 2020
745836d
Fixing span prediction forward for multi-GPU
invalid-email-address Feb 8, 2020
14caaab
fix commonsenseQA forward
invalid-email-address Feb 8, 2020
4271a7a
Merge branch 'master' of https://github.com/nyu-mll/jiant into MLM
phu-pmh Feb 10, 2020
918c0df
MLM
phu-pmh Feb 10, 2020
5ed0691
adding function documentation
Feb 11, 2020
ffac8bf
Merge branch 'master' into fix_dataparallel_metric_calculation
Feb 11, 2020
fe86d96
resolving nits, fixing seq_gen forward
Feb 11, 2020
eee439f
Merge branch 'fix_dataparallel_metric_calculation' of https://github.…
Feb 11, 2020
b61fa7c
remove nit
Feb 11, 2020
55312e8
fixing batch_size assert and SpanPrediction task
Feb 12, 2020
7d165cf
Remove debugging
Feb 12, 2020
52f66c7
Fix batch size mismatch multi-GPU test
Feb 12, 2020
a0220f8
Fix order of assert checking for batch size mismatch
Feb 12, 2020
fe89674
mlm training
phu-pmh Feb 12, 2020
2218e5b
update
phu-pmh Feb 14, 2020
cd75715
Merge branch 'fix_dataparallel_metric_calculation' of https://github.…
phu-pmh Feb 14, 2020
58b2914
sbatch
phu-pmh Feb 16, 2020
052b1c0
update
phu-pmh Feb 17, 2020
b26927a
data parallel
phu-pmh Feb 17, 2020
cd4b5a6
update data parallel stuffs
phu-pmh Feb 19, 2020
0d6d691
update MLM
phu-pmh Feb 20, 2020
b3617fa
using sequencelabel, using 1 paragraph per example
Feb 23, 2020
0af6476
update label mapping
phu-pmh Feb 24, 2020
e9f863c
adding exmaples-porportion-mixing
Feb 24, 2020
89e44c5
changing dataloader to work with wikitext103
Feb 24, 2020
0752771
weight sampling
Feb 24, 2020
5482ac2
add early stopping only onb one task
Mar 5, 2020
6d85b27
commit
phu-pmh Mar 6, 2020
d67e195
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
phu-pmh Mar 6, 2020
921e717
Merge branch 'master' of https://github.com/nyu-mll/jiant into MLM
Mar 8, 2020
05d5750
Cleaning up code
Mar 8, 2020
ddcd357
Removing unecessarily tracked git folders
Mar 8, 2020
9e4e3a7
Removing unnecesary changes
Mar 8, 2020
b9b5f57
revert README
Mar 8, 2020
6b4c9d5
revert README.md again
Mar 8, 2020
35130ca
Making more general for Transformer-based embedders
Mar 8, 2020
20de779
torch.uint8 -> torch.bool
Mar 8, 2020
4020c81
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 8, 2020
09f5903
Fixing indexing issues
Mar 8, 2020
4f45826
get rid of unecessary changes
Mar 8, 2020
8ac8c70
black cleanup
Mar 8, 2020
6cee66e
update
phu-pmh Mar 8, 2020
3709696
Prevent updating update_metrics twice in one step
Mar 10, 2020
3fb4e3e
ALBERT SOP update
phu-pmh Mar 16, 2020
a56b7c7
update
phu-pmh Mar 18, 2020
b84da1d
update
phu-pmh Mar 18, 2020
2a19c2c
update
phu-pmh Mar 18, 2020
e7acb76
add base_roberta
Mar 20, 2020
b1ac702
update
phu-pmh Mar 20, 2020
c5fddf0
reverting CCG edit added for debugging
phu-pmh Mar 20, 2020
9774b61
refactor defaults.conf
phu-pmh Mar 20, 2020
194c2d4
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 22, 2020
4be35b3
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 22, 2020
429be9a
black formatting
Mar 22, 2020
a9555b1
merge
Mar 22, 2020
13002f6
removed SOP task and mlm_manual_scaling
phu-pmh Mar 22, 2020
9e6bc5d
Fixing label namespace vocabulary creation, mergeing from master
Mar 22, 2020
a0aad25
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 22, 2020
85db63e
Deleting MLM weight
Mar 22, 2020
4536433
Merge branch 'master' into MLM
Mar 22, 2020
85b081b
black formatting
Mar 22, 2020
eabe292
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 22, 2020
09caf0f
Adding early_stopping_method to defaults.conf
Mar 22, 2020
a7f8f16
Fixing MLM with preprocessed wikitext103
Mar 24, 2020
94c32ae
Deleting intermediate class hierarchy for MLM
Mar 24, 2020
74d474b
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Mar 24, 2020
1d20684
Correcting black
Mar 24, 2020
12c0da1
LanguageModelingTask -> AutoregressiveModelingTask
Mar 24, 2020
960cf63
code style
Mar 24, 2020
f0d3b6d
fixing MaskedLanguageModelTask
Mar 25, 2020
cd2042e
Fixing typo
Mar 25, 2020
cf7612a
Fixing label namespace
Mar 27, 2020
f21165c
extracting out masking portion
Mar 28, 2020
1f25078
Revert "extracting out masking portion"
Apr 2, 2020
fec67cb
Code cleanup
Apr 2, 2020
c766706
Adding tests for early_stpping_method
Apr 2, 2020
1a5c06b
Merge branch 'master' of https://github.com/nyu-mll/jiant into MLM
Apr 2, 2020
5c3ff7b
Adding pretrain_stop_metric
Apr 3, 2020
8ca1eba
Reverting get_data_iter
Apr 3, 2020
9b377ab
Reverting to get_data_iter
Apr 6, 2020
bf841e9
Fixing get_pretrained_lm_head for all embedder types
Apr 6, 2020
2349464
Extracting out MLM probability masking
Apr 7, 2020
cf223a4
Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM
Apr 7, 2020
a3465c1
Move dynamic masking function to Task for easier testing
Apr 8, 2020
0f5b849
Adding unit tests for MLM
Apr 8, 2020
fb9ce83
Adding change to MLM forward function to expose more intermediate ste…
Apr 8, 2020
a59c762
Fixing code style
Apr 9, 2020
e9eb5f0
Adding more detailed instructions of how to generate Wikipedia data
Apr 10, 2020
1a76df0
Adding rest of MLM data generation code
Apr 10, 2020
34c924b
Black style and remove comment
Apr 10, 2020
da5fe19
black style
Apr 10, 2020
9446cb7
updating repro code for MLM data
phu-pmh Apr 10, 2020
3f6eb92
updating repro code for MLM data
phu-pmh Apr 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
updating repro code for MLM data
  • Loading branch information
phu-pmh committed Apr 10, 2020
commit 9446cb78060eef0e5048b98832e1e1af93d9783b
14 changes: 7 additions & 7 deletions scripts/mlm/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Downloading Wikipedia Corpus
We use the preprocessing code from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#getting-the-data
and the bash scripts provided here is used to help with streamlining the data generation in the NVIDIA repository.
We use the preprocessing code from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#getting-the-data
and the bash scripts provided here is used to help with streamlining the data generation in the NVIDIA repository.

First, git clone https://github.com/NVIDIA/DeepLearningExamples.git.
Then, move create_wiki_data.sh and get_small_english_wiki.sh into DeepLearningExamples/PyTorch/LanguageModeling/BERT/data.
First, git clone https://github.com/NVIDIA/DeepLearningExamples.git.
Then, move create_wiki_data.sh and get_small_english_wiki.sh into DeepLearningExamples/PyTorch/LanguageModeling/BERT/data.

You will have to set 'BERT_PREP_WORKING_DIR' as an environment variable to specify the directory you would like to save the
Wikipedia data to.

Then, follow the instructions below:

Run `bash create_wiki_data.sh $lang $save_directory`
The NVIDIA code supports English (en) and Chinese (zh) wikipedia.

For example, to download and process English Wikipedia and save it in `~/Download` directory, run
`bash create_wiki_data.sh en ~/Download`

The above command will download the entire English Wikipedia.

In our experiments, we only use a small subset (around 5% of) the entire English Wikipedia, which has the same number of sentences as Wikitext103.
In our experiments, we only use a small subset (around 5% of) the entire English Wikipedia, which has the same number of sentences as Wikitext103.
To get this subset, run `bash get_small_english_wiki.sh $path_to_wikicorpus_en`. where $path_to_wikicorpus_en is the directory where you saved the full processed `wikicorpus_en` corpus.

11 changes: 6 additions & 5 deletions scripts/mlm/create_wiki_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# limitations under the License.

lang=$1 #the language, 'en' for English wikipedia
save_dir=$2
export BERT_PREP_WORKING_DIR=$2

# clone wikiextractor if it doesn't exist
if [ ! -d "wikiextractor" ]; then
Expand All @@ -23,19 +23,19 @@ fi

echo "Downloading $lang wikpedia in directory $save_dir"
# Download
python3 bertPrep.py --action download --dataset wikicorpus_$lang --save_dir $save_dir
python3 bertPrep.py --action download --dataset wikicorpus_$lang


# Properly format the text files
python3 bertPrep.py --action text_formatting --dataset wikicorpus_$lang --save_dir $save_dir
python3 bertPrep.py --action text_formatting --dataset wikicorpus_$lang


# Shard the text files (group wiki+books then shard)
python3 bertPrep.py --action sharding --dataset wikicorpus_$lang --save_dir $save_dir
python3 bertPrep.py --action sharding --dataset wikicorpus_$lang


# Combine sharded files into one
save_dir=$save_dir/sharded_training_shards_256_test_shards_256_fraction_0.2/wikicorpus_$lang
save_dir=$BERT_PREP_WORKING_DIR/sharded_training_shards_256_test_shards_256_fraction_0.2/wikicorpus_$lang
cat $save_dir/*training*.txt > $save_dir/train_$lang.txt
cat $save_dir/*test*.txt > $save_dir/test_$lang.txt
rm -rf $save_dir/wiki*training*.txt
Expand All @@ -46,3 +46,4 @@ sed -i 's/<[^>]*>//g' $save_dir/train_$lang.txt
sed -i 's/<[^>]*>//g' $save_dir/test_$lang.txt

echo "Your corpus is saved in $save_dir"