Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
comm		comm
compress		compress
data		data
data_parallel		data_parallel
empty_model_configs		empty_model_configs
example_scripts		example_scripts
modules		modules
optimizer		optimizer
pipeline_parallel		pipeline_parallel
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_gptj_to_hf.py		convert_gptj_to_hf.py
convert_gptneo_to_hf.py		convert_gptneo_to_hf.py
convert_opt_checkpoint.py		convert_opt_checkpoint.py
dist_lm_train.py		dist_lm_train.py

Repository files navigation

CocktailSGD

Quick Start

(1) Setup

Install PyTorch env:

conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
conda install -c conda-forge cupy nccl cudatoolkit=11.6

pip install transformers==4.21.1
pip install datasets
pip install netifaces
pip install zstandard
pip install wandb

As we use wandb to manage experiments, one should also configure wandb before running the code

wandb login

(Optional) Install bitsandbytes for 8bit Adam

pip install bitsandbytes # optional, to use 8bit-adam

(Optional) Install FlashAttention

https://github.com/HazyResearch/flash-attention

pip install flash-attn

In case compiling error, try set CUDA_HOME, e.g.~export CUDA_HOME=/usr/local/cuda-11.6

We now integrate flash attention to OPT models. Please set --model-type flash_opt to use it .

(Optional) Install Nvidia Apex

git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

In case compiling error, try set CUDA_HOME, e.g.~export CUDA_HOME=/usr/local/cuda-11.6

(2) Download Pretrained Models

We provide pretrained model checkpoints that are sharded by layers:

Please download and unzip the above ckpts to fine-tune them. The path of unzipped model should be passed to --model-name and --tokenizer-name for fine-tuning.

(3) Run Fine-Tuning

An Example of OPT-1.3B

Please refer to example_scripts/finetune_opt1.3b.sh, which shows an example to fine-tune OPT-1.3B on mmlu-cot data. The script will launch 8 processes with a data parallel degree of 4 and a pipeline parallel degree of 2.

In case of geo-distributed training, please first make sure the network interface is correctly set and the master (rank 0 worker) IP and port are accesible by all the workers. After that, run the corresponding process on each GPU node.

# set enviroment vars
...

# run on each GPU node
python dist_lm_train.py ... --cuda-id 0 --rank ${GLOBAL_RANK}

Arguments

Enviroment vars that should be set:

export GLOO_SOCKET_IFNAME=lo # the correct interface
export NCCL_SOCKET_IFNAME=lo # the correct interface
export WANDB_NAME=opt-test # wandb run name

export RANDOMP_RATIO=0.1   # CocktailSGD: Random sparsity ratio
export TOPK_RATIO=0.2      # CocktailSGD: TopK sparsity ratio
export QUANT_BITS=4        # CocktailSGD: Quantization bits

The following arguments should be carefully set:

--model-name: The path of model ckpt sharded by layers.
--tokenizer-name: Usually the same to --model-name. You can also use HF's model name.
--model-type: Indicate the model type. {opt, flash_opt, gptj, gptneox}. The 'flash_' prefix uses flash attention to accelerate training.
--num-layers: Number of Transformer layers for each GPU. E.g. OPT-1.3B has 24 layers, if we use two GPUs to form a pipeline, --num-layers should be 12.
--embedding-dim: The hidden size of the model. OPT-1.3B is 2048, GPT-J-6B is 4096, GPT-NeoX-20B is 6144. This is used to create buffers.
--dist-url: URL of rank 0 worker (master). It is the same to all workers. And this URL should be accessible by all workers. For local training (single machine multiple GPUs), this can be like --dist-url tcp://127.0.0.1:7033
--world-size: The total number of workers. world-size == pipeline-group-size * data-group-size
--pipeline-group-size: Number of GPU workers for each pipeline
--data-group-size: Number of data parallel workers. Also the number of pipelines.
--net-interface: Network interface. Should be consistent with GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME.

The following arguments can be tuned / changed:

--optimizer: Optimizer type. {adam, 8bit-adam} (8bit-adam requires pip install bitsandbytes)
--load-pretrained-model: Whether to load model weights. Usually true.
--task-name: The task name or the path of a jsonl file. For multi-task training separate task names by ,. There is an optional sampling weight after each task name, separated by : (default is 1.0). Sampling weights will be normalized. E.g. it should be like --task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0.
--checkpoint-path: Path to save fine-tuned checkpoints.
--checkpoint-steps: Save ckpt every checkpoint-steps.
--total-steps: Total number of steps for training. (This counts all gradient-accumulate-steps.)
--warmup-steps: LR warmup steps.
--lr: learning rate
--seq-length: sequence length
--batch-size: batch size for each GPU device (of each gradient accumulation step).
--micro-batch-size: micro batch size for pipeline parallelism. 1 works fine.
--gradient-accumulate-step: Accumulate gradients for several steps before updating parameters. This is another way to achieve large batch sizes when GPU memory is not enough.
--dp-backend: {gloo, nccl}
--dp-mode: {allreduce, cocktail_sgd}. cocktail_sgd should always set --dp-backend gloo, allreduce performs better at nccl.

The following arguments usually do not change:

--fp16: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.
--pp-mode: always gpipe
--profiling: {no-profiling, tidy_profiling}. tidy_profiling will generate profile jsons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CocktailSGD

Quick Start

(1) Setup

(Optional) Install bitsandbytes for 8bit Adam

(Optional) Install FlashAttention

(Optional) Install Nvidia Apex

(2) Download Pretrained Models

(3) Run Fine-Tuning

An Example of OPT-1.3B

Arguments

About

Releases

Packages

Languages

License

tiendung/CocktailSGD

Folders and files

Latest commit

History

Repository files navigation

CocktailSGD

Quick Start

(1) Setup

(Optional) Install bitsandbytes for 8bit Adam

(Optional) Install FlashAttention

(Optional) Install Nvidia Apex

(2) Download Pretrained Models

(3) Run Fine-Tuning

An Example of OPT-1.3B

Arguments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages