Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-05-07-alibi-mlm #70

Merged
merged 10 commits into from
Mar 22, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
dataset & benchmark citations, conclusion expansion
  • Loading branch information
EIFY committed Dec 4, 2023
commit b8e3983eb8ad8d7bfb6677f9aabf2d6c59e51c91
10 changes: 6 additions & 4 deletions _posts/2024-05-07-distill-ALiBi-MLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,8 @@ We removed the `embed_dim x embed_dim` fully-connected layer, activation functio

## Experiments

### [WikiText-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
At first we tested the changes with the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) with a GeForce RTX 3080 16 GB Laptop GPU, using the validation set MLM perplexity as the metric. We tested the baseline, learned-clap (baseline + CLAP head), ALiBi (baseline + ALiBi), and zero-clap (baseline + CLAP head + ALiBi), in addition to baseline but with sinusoidal positional encoding:
### WikiText-103
At first we tested the changes with the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) <d-cite key="DBLP:conf/iclr/MerityX0S17"></d-cite> with a GeForce RTX 3080 16 GB Laptop GPU, using the validation set MLM perplexity as the metric. We tested the baseline, learned-clap (baseline + CLAP head), ALiBi (baseline + ALiBi), and zero-clap (baseline + CLAP head + ALiBi), in addition to baseline but with sinusoidal positional encoding:

{% include figure.html path="assets/img/2024-05-07-distill-ALiBi-MLM/valid_ppl_cleaned.png" class="img-fluid" %}

Expand All @@ -131,8 +131,8 @@ where solid lines are what's considered "canonical" setup and dotted lines are e

As we can see, the dotted lines are almost on top of the solid lines. Notably, sinusoidal positional encoding underperforms significantly compared to the baseline.

### [The Pile](https://arxiv.org/abs/2101.00027)
As the next step, we scaled our experiments to train on [the Pile](https://arxiv.org/abs/2101.00027) for one epoch. About half of the examples in the Pile has sequence length > 1024, so we set sequence length to 2048. Even so, ~1/7 of the examples have sequence length > 2048 and had to be discarded. In the end, one epoch consists of 133082 updates and [we employ cosine learning rate schedule while "overestimating" the number of training steps by 10%](https://github.com/EIFY/fairseq/blob/33fb2c306851f104cc567b7fe865b1e3fd1e6fe7/examples/roberta/config/pretraining/baseline_pile.yaml#L31-L36), as inspired by the Chinchilla paper <d-cite key="hoffmann2022training"></d-cite>. In addition to the validation MLM perplexity, we also fine-tuned the models on the [GLUE](https://gluebenchmark.com/) benchmark. As in the original RoBERTa paper, we tested both the `roberta.base` with 125M parameters and `roberta.large` with 355M parameters. These experiments were performed on 8 x A100 40GB SXM4 GPUs, where the `roberta.base` experiments took ~3 days and `roberta.large` experiments took ~9 days. In the table below, `PPL` is the final validation MLM perplexity, `STS-B` is the best validation loss, and all the others are the best validation accuracies over 10 epochs of finetuning.
### The Pile
As the next step, we scaled our experiments to train on the Pile <d-cite key="DBLP:journals/corr/abs-2101-00027"></d-cite> for one epoch. About half of the examples in the Pile has sequence length > 1024, so we set sequence length to 2048. Even so, ~1/7 of the examples have sequence length > 2048 and had to be discarded. In the end, one epoch consists of 133082 updates and [we employ cosine learning rate schedule while "overestimating" the number of training steps by 10%](https://github.com/EIFY/fairseq/blob/33fb2c306851f104cc567b7fe865b1e3fd1e6fe7/examples/roberta/config/pretraining/baseline_pile.yaml#L31-L36), as inspired by the Chinchilla paper <d-cite key="hoffmann2022training"></d-cite>. In addition to the validation MLM perplexity, we also fine-tuned the models on the [GLUE](https://gluebenchmark.com/) benchmark <d-cite key="wang-etal-2018-glue"></d-cite>. As in the original RoBERTa paper, we tested both the `roberta.base` with 125M parameters and `roberta.large` with 355M parameters. These experiments were performed on 8 x A100 40GB SXM4 GPUs, where the `roberta.base` experiments took ~3 days and `roberta.large` experiments took ~9 days. In the table below, `PPL` is the final validation MLM perplexity, `STS-B` is the best validation loss, and all the others are the best validation accuracies over 10 epochs of finetuning.

#### `roberta.base`
```
Expand All @@ -159,6 +159,8 @@ We found that ALiBi no longer helps lowering the validation MLM perplexity. Furt
## Conclusions
This seems to be another case where models with lower perplexity do not necessarily yield higher accuracies for downstream tasks and architectural changes beneficial for models at smaller scales do not imply the same for models at larger scales <d-cite key="tay2022scaling"></d-cite>. CLAP head, however, is simpler than the standard prediction head for MLMs, requires minimal changes, and may be worth trying especially at larger scales.

In the broader context, MosaicBERT <d-cite key="portes2023mosaicbert"></d-cite> and LittleBird <d-cite key="lee-etal-2022-littlebird"></d-cite> are most similar to our experiments. In the MosaicBERT paper, Portes et al. also evaluate BERT-style masked LMs with symmetric (option 1) encoder-attention ALiBi on the GLUE benchmark and find performance exceeding the BERT baseline within limited training budget. However, these MosaicBERT models were trained with much shorter (128) sequence length and so may have avoided the sequence length regime in which perplexity and performance of certain downstream tasks start to deteriorate. The LittleBird architecture is designed for question answering and built with BiALiBi (Bidirectional ALiBi), a variation of option 3 (nonsymmetric with different slopes) where the model not only learned the forward and backward slopes $$m_{fwd}$$ and $$m_{bwd}$$, but also a special bias value for the attention weight of the global `[CLS]` token. Lee et al. evaluate LittleBird models on a collection of QA Benchmarks for both English and Korean and report favorable performance, but leave the question open whether they work well for other NLP tasks. Notably, we also found our ALiBi models capable of matching the baseline performance of the question answering task `QNLI`, so the reported performance is compatible with our experiments even without attributing to the other differences in architecture or pretraining task.

## Model checkpoints
Final checkpoints for models trained on the Pile:

Expand Down
90 changes: 90 additions & 0 deletions assets/bibliography/2024-05-07-distill-ALiBi-MLM.bib
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,47 @@ @inproceedings{press-wolf-2017-using
abstract = "We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model. We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.",
}

@inproceedings{DBLP:conf/iclr/MerityX0S17,
author = {Stephen Merity and
Caiming Xiong and
James Bradbury and
Richard Socher},
title = {Pointer Sentinel Mixture Models},
booktitle = {5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net},
year = {2017},
url = {https://openreview.net/forum?id=Byj72udxe},
timestamp = {Thu, 25 Jul 2019 14:25:57 +0200},
biburl = {https://dblp.org/rec/conf/iclr/MerityX0S17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-2101-00027,
author = {Leo Gao and
Stella Biderman and
Sid Black and
Laurence Golding and
Travis Hoppe and
Charles Foster and
Jason Phang and
Horace He and
Anish Thite and
Noa Nabeshima and
Shawn Presser and
Connor Leahy},
title = {The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
journal = {CoRR},
volume = {abs/2101.00027},
year = {2021},
url = {https://arxiv.org/abs/2101.00027},
eprinttype = {arXiv},
eprint = {2101.00027},
timestamp = {Thu, 14 Oct 2021 09:16:12 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2101-00027.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@misc{hoffmann2022training,
title={Training Compute-Optimal Large Language Models},
author={Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Erich Elsen and Jack W. Rae and Oriol Vinyals and Laurent Sifre},
Expand All @@ -125,6 +166,28 @@ @misc{hoffmann2022training
primaryClass={cs.CL}
}

@inproceedings{wang-etal-2018-glue,
title = "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding",
author = "Wang, Alex and
Singh, Amanpreet and
Michael, Julian and
Hill, Felix and
Levy, Omer and
Bowman, Samuel",
editor = "Linzen, Tal and
Chrupa{\l}a, Grzegorz and
Alishahi, Afra",
booktitle = "Proceedings of the 2018 {EMNLP} Workshop {B}lackbox{NLP}: Analyzing and Interpreting Neural Networks for {NLP}",
month = nov,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W18-5446",
doi = "10.18653/v1/W18-5446",
pages = "353--355",
abstract = "Human ability to understand language is \textit{general, flexible, and robust}. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, gluebenchmark.com): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.",
}

@misc{chowdhery2022palm,
title={PaLM: Scaling Language Modeling with Pathways},
author={Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Ben Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garcia and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathy Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},
Expand All @@ -144,3 +207,30 @@ @misc{tay2022scaling
archivePrefix={arXiv},
primaryClass={cs.LG}
}

@inproceedings{portes2023mosaicbert,
title={MosaicBERT: How to Train BERT with a Lunch Money Budget},
author={Jacob Portes and Alexander R Trott and Sam Havens and DANIEL KING and Abhinav Venigalla and Moin Nadeem and Nikhil Sardana and Daya Khudia and Jonathan Frankle},
booktitle={Workshop on Efficient Systems for Foundation Models @ ICML2023},
year={2023},
url={https://openreview.net/forum?id=WH1S0gonzR}
}

@inproceedings{lee-etal-2022-littlebird,
title = "LittleBird: Efficient Faster & Longer Transformer for Question Answering",
author = "Lee, Minchul and
Han, Kijong and
Shin, Myeong Cheol",
editor = "Goldberg, Yoav and
Kozareva, Zornitsa and
Zhang, Yue",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.352",
doi = "10.18653/v1/2022.emnlp-main.352",
pages = "5261--5277",
abstract = "BERT has shown a lot of sucess in a wide variety of NLP tasks. But it has a limitation dealing with long inputs due to its attention mechanism. Longformer, ETC and BigBird addressed this issue and effectively solved the quadratic dependency problem. However we find that these models are not sufficient, and propose LittleBird, a novel model based on BigBird with improved speed and memory footprint while maintaining accuracy. In particular, we devise a more flexible and efficient position representation method based on Attention with Linear Biases(ALiBi). We also show that replacing the method of global information represented in the BigBird with pack and unpack attention is more effective. The proposed model can work on long inputs even after being pre-trained on short inputs, and can be trained efficiently reusing existing pre-trained language model for short inputs. This is a significant benefit for low-resource languages where large amounts of long text data are difficult to obtain. As a result, our experiments show that LittleBird works very well in a variety of languages, achieving high performance in question answering tasks, particularly in KorQuAD2.0, Korean Question Answering Dataset for long paragraphs.",
}