-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM during ckpt saving for Llama2-70b #142
Comments
I have the same problem. I think everything is brought back to the first rank before saving and it causes CUDA OOM. I was able to save one .distcp file per GPU but I'm not sure how to get just the LoRA adapter file from there... |
I'm also running into this (albeit with 4 A100 80GB). Wondering if there is a way we can work around it - happy to make a contribution if the direction is clear. Seems like a shame to have this bug during
|
Any updates on this issue? I'm encountering the same problem with 8 x A100 (80GB) for Lora 70B |
I'm facing the same issue. Any workaround? |
Also facing the same issue here. I noticed another thread facing a similar issue using LoRa fine-tuning (although with another model): philschmid/deep-learning-pytorch-huggingface#16 |
I found a workaround which involves allowing CPU offloading during the phase of saving the state dict. I tested that end-to-end 70B training works with checkpointing on this repo. I will try to find the time to merge in the changes soon, but you can find them here https://github.com/modal-labs/llama-recipes |
I'm encountering the same problem with 6 x H800 (80GB) for Lora 70B |
I'm encountering the same problem with 8 x H100 (80GB) for Lora 70B |
sorry for the late reply @yuanzhedong and everyone, is this happening only alpaca? I believe some of the issue from Sep should be resolved. We have the CPU offload now if that can be helpful. I don't have H100s off of my hand now, but looking to get access and repro the issue. |
@yuanzhedong It seems like an issue with transformers, could repro this issue with transformers version of
can you pls give it a try. |
@HamidShojanazeri Thank you very much for your reply. I tried with a new install of |
sure, it shouldn't have anything to do with your batch size, the version conflict was the only way I could repro and by pass it. Pls let know how it went. |
I tried with the following package versions and a fresh install of llama-recipes, but still getting the same error.
Below is the command that I used to start the finetuning job
Training Epoch: 1/3, step 46/47 completed (loss: 0.15002813935279846): 100%|████████████████████████████████████████████████| 47/47 [34:52<00:00, 44.52s/it] And the following error in another iteration evaluating Epoch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:41<00:00, 3.75s/it] |
@Mugheeera I wonder if you are installing llama-recipes from src? This seems to be working on my end, running H100, but regardless it should work on both A100 and H100. Logs I could use the latest transfromers as well so not from src anymore,
|
sounds to be stale issue, will close it for now but feel free to re-open is see same issues. |
Hi @HamidShojanazeri , I also got the same OOM error when using 8xH100s. llama-recipes/src/llama_recipes/utils/train_utils.py Lines 228 to 259 in c1f8de2
Would it possible the reason of OOM is because the model needs to gather weights across ranks before model.save_pretrained ?
|
Hi @HamidShojanazeri I think you were testing with 7B model, most of the people here seeing the issue with 70B model. I also had the same issue with 70B model with alpaca dataset. I have installed llama-recipes from src, but still not working |
Hi, I am using 8*a100-80gb to lora-finetune Llama2-70b, the training and evaluation during epoch-1 went well, but went OOM when saving the peft model. The nightly version of pytorch is used.
The following command is used:
torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --model_name ../Llama-2-70b-chat-hf --micro_batch_size 1 --batch_size_training 1 --dist_checkpoint_root_folder ../Llama-2-70b-chat-hf/ --dist_checkpoint_folder fine-tuned --use_peft --peft_method lora --lr 3e-4 --epoch 2 --pure_bf16 --alpaca_dataset --output_dir llama-70b-lorawallsft
"we are about to save the PEFT modules", it went CUDA OOM after this log is printed.
The text was updated successfully, but these errors were encountered: