-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch when I load using adapter path but not checkpoint #2071
Comments
Could you please post the full error message? Also, the more code you can share the likelier I can help. Are you loading on the same setup with multi GPU?
I don't quite see what the difference is supposed to be. In one case you pass a variable that is a string and in the other string directly, is that it?
Yes, that is really strange, you mean that if you train a model with these settings, the same code works for loading that model's adapter? |
I would share the complete error message later on. But these are not the same scenario:
So in this case after the training is over the model and tokenizer are saved. Under the second case where there is no error: I am loading the PEFT model not from adapter path, it is from checkpoint location, which is saved during training by setting save_step and output_dir. I am seeing similar issues where people complain it is caused by deep speed. During the deep speed training I get:
|
Thanks for your response.
|
Okay, just to ensure that I understand correctly: When you load from the checkpoint that the trainer automatically created for you, loading works, but not if you try to load the file that you saved using Could you please compare the sizes of the files created by both methods? It could be the case that the automatic checkpoint saves the full model, not only the PEFT adapter, in which case the checkpoint would be much larger. One thing you could try to see if it fixes the PEFT checkpoint for you is to gather the parameters before calling with deepspeed.zero.GatheredParameters(trainer.model.parameters()):
trainer.model.save_pretrained(<path>)
This is really strange, since it does not appear that parameters are missing, but instead that they have the wrong shape (transposed). I have never seen this. |
Yes, that is correct.
Indeed this is a case and checkpoint model is larger. I wonder why this happens under the case that target module includes o_proj and k_proj? Apparently lots of people are experiencing the same problem, so I think there should be a permanent fix for this: |
I would like to investigate this further but with the given information, I can't. So far, when I tried to reproduce this, the checkpoint that was created by Did you try what I suggested above using Update I managed to create a situation that resulted in the checkpoint from Using this context manager allowed me to save an intact checkpoint: import deepspeed
with deepspeed.zero.GatheredParameters((p for n, p in trainer.model.named_parameters() if "lora" in n)):
if trainer.accelerator.is_main_process:
model.save_pretrained(<checkpoint_path>) |
I have not tried that, but I will try this next time I am doing training and update you here. I see you are changing the command a little bit: initially it was:
Now you are saving only the parameters that have 'lora' in their name. import deepspeed
It is interesting since under the case where the model was loaded correctly without using the checkpoint the r was 8 and under r= 16, it was the issue, which is strange since 16 also is not that big to cause much issues. Can you elaborate why this happens? Is it because of deepspeed? |
Yes, when calling
I don't know enough about what DeepSpeed does under the hood to determine how models are sharded. But from what we can observe, this appears to be enough to make a difference. Same with what you said earlier:
This also should not make a big difference in the grand scheme of things but apparently does.
Thanks, hopefully it helps. |
Thanks @BenjaminBossan, using this:
The saved model works just fine as before. |
System Info
MultiGPU setting with 2 A100 GPU.
GPU memory is 80GB each.
Training is done using accelerate and Deep Speed.
Who can help?
No response
Information
Tasks
examples
folderReproduction
I have pretrained my LLM model using Deep speed and accelerate on 2 GPUs. I have the following on my LoraConfig:
when I use the following code to load my model with Peft config I get error:
However when I do this everything works but why???
model = PeftModel.from_pretrained(model, 'path to checkpoint)
To make it more strange the above code works under the following setting with no error.
Can anyone explains what is going on? literally everything is the same, only I added two more layers to target_module.
Expected behavior
Expected behavior is that this command runs with no problem since neither tokenizer nor the base model is changed since the pretraining phase:
I expect this code works with no problem:
The text was updated successfully, but these errors were encountered: