Skip to content

Latest commit

 

History

History

megatron

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Megatron-LM

Readme

Original Megatron-LM readme

Installation

cd ./data
make

Data Preprocessing

docs from Megatron-LM

changes in preprocess_data.py:

  • preprocess_data.py script is moved to megatron folder
  • supports tokenizers from HuggingFace Transformers
  • input can be a folder with multiple json/jsonl files

example usage with HF Tokenizer:

python preprocess_data.py \
       --input ./train \
       --output-prefix ./train \
       --dataset-impl mmap \
       --tokenizer-type HFTokenizer \
       --tokenizer-name-or-path bert-base-uncased \
       --split-sentences --workers 8