- Install PyTorch env:
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda install -c conda-forge cupy nccl cudatoolkit=11.6
pip install transformers==4.21.1
pip install datasets
pip install netifaces
pip install zstandard
pip install wandb
As we use wandb to manage experiments, one should also configure wandb
before running the code
wandb login
pip install bitsandbytes # optional, to use 8bit-adam
https://github.com/HazyResearch/flash-attention
pip install flash-attn
In case compiling error, try set CUDA_HOME
, e.g.~export CUDA_HOME=/usr/local/cuda-11.6
We now integrate flash attention to OPT models. Please set --model-type flash_opt
to use it .
git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
In case compiling error, try set CUDA_HOME
, e.g.~export CUDA_HOME=/usr/local/cuda-11.6
We provide pretrained model checkpoints that are sharded by layers:
Please download and unzip the above ckpts to fine-tune them.
The path of unzipped model should be passed to --model-name
and --tokenizer-name
for fine-tuning.
Please refer to example_scripts/finetune_opt1.3b.sh
, which shows an example to fine-tune OPT-1.3B on mmlu-cot data.
The script will launch 8 processes with a data parallel degree of 4 and a pipeline parallel degree of 2.
In case of geo-distributed training, please first make sure the network interface is correctly set and the master (rank 0 worker) IP and port are accesible by all the workers. After that, run the corresponding process on each GPU node.
# set enviroment vars
...
# run on each GPU node
python dist_lm_train.py ... --cuda-id 0 --rank ${GLOBAL_RANK}
Enviroment vars that should be set:
export GLOO_SOCKET_IFNAME=lo # the correct interface
export NCCL_SOCKET_IFNAME=lo # the correct interface
export WANDB_NAME=opt-test # wandb run name
export RANDOMP_RATIO=0.1 # CocktailSGD: Random sparsity ratio
export TOPK_RATIO=0.2 # CocktailSGD: TopK sparsity ratio
export QUANT_BITS=4 # CocktailSGD: Quantization bits
The following arguments should be carefully set:
--model-name
: The path of model ckpt sharded by layers.--tokenizer-name
: Usually the same to--model-name
. You can also use HF's model name.--model-type
: Indicate the model type. {opt, flash_opt, gptj, gptneox}. The 'flash_' prefix uses flash attention to accelerate training.--num-layers
: Number of Transformer layers for each GPU. E.g. OPT-1.3B has 24 layers, if we use two GPUs to form a pipeline,--num-layers
should be 12.--embedding-dim
: The hidden size of the model. OPT-1.3B is 2048, GPT-J-6B is 4096, GPT-NeoX-20B is 6144. This is used to create buffers.--dist-url
: URL of rank 0 worker (master). It is the same to all workers. And this URL should be accessible by all workers. For local training (single machine multiple GPUs), this can be like--dist-url tcp://127.0.0.1:7033
--world-size
: The total number of workers.world-size == pipeline-group-size * data-group-size
--pipeline-group-size
: Number of GPU workers for each pipeline--data-group-size
: Number of data parallel workers. Also the number of pipelines.--net-interface
: Network interface. Should be consistent withGLOO_SOCKET_IFNAME
andNCCL_SOCKET_IFNAME
.
The following arguments can be tuned / changed:
--optimizer
: Optimizer type. {adam, 8bit-adam} (8bit-adam requirespip install bitsandbytes
)--load-pretrained-model
: Whether to load model weights. Usuallytrue
.--task-name
: The task name or the path of ajsonl
file. For multi-task training separate task names by,
. There is an optional sampling weight after each task name, separated by:
(default is 1.0). Sampling weights will be normalized. E.g. it should be like--task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0
.--checkpoint-path
: Path to save fine-tuned checkpoints.--checkpoint-steps
: Save ckpt everycheckpoint-steps
.--total-steps
: Total number of steps for training. (This counts allgradient-accumulate-step
s.)--warmup-steps
: LR warmup steps.--lr
: learning rate--seq-length
: sequence length--batch-size
: batch size for each GPU device (of each gradient accumulation step).--micro-batch-size
: micro batch size for pipeline parallelism. 1 works fine.--gradient-accumulate-step
: Accumulate gradients for several steps before updating parameters. This is another way to achieve large batch sizes when GPU memory is not enough.--dp-backend
: {gloo, nccl}--dp-mode
: {allreduce, cocktail_sgd}.cocktail_sgd
should always set--dp-backend gloo
,allreduce
performs better atnccl
.
The following arguments usually do not change:
--fp16
: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.--pp-mode
: alwaysgpipe
--profiling
: {no-profiling, tidy_profiling}.tidy_profiling
will generate profile jsons.