Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix save pretrained for TPUs #105

Merged
merged 2 commits into from
May 17, 2021
Merged

Conversation

kaushikb11
Copy link
Contributor

@kaushikb11 kaushikb11 commented Mar 19, 2021

Related to #97

Hello @minimaxir! 👋🏽

The code hangs for TPUs when it calls xm.save for the save_pretrained method, xm_save is passed to the save_function param for TPUs.

There's one more update on the Lightning side for TPUs. The update will be available in the next patch release on Tuesday next week. Thank you!

@@ -172,10 +176,19 @@ def on_batch_end(self, trainer, pl_module):
desc += f" — GPU Mem: {gpu_memory} MB"
self.main_progress_bar.update(self.progress_bar_refresh_rate)
self.main_progress_bar.set_description(desc)


if _TPU_AVAILABLE and self.save_every_check:
Copy link
Contributor

@SeanNaren SeanNaren Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get some more information around this and why it's necessary? I think we can iterate on this to get something cleaner; the idea is that the boilerplate can live within lightning, outside of the user code.

@kaushikb11
Copy link
Contributor Author

kaushikb11 commented Mar 19, 2021

Converting the PR into a Draft. As found an issue during generate_sample_text with TPUs. Will update soon.

@kaushikb11 kaushikb11 marked this pull request as draft March 19, 2021 15:08
@minimaxir
Copy link
Owner

minimaxir commented Mar 19, 2021

Hi Kaushik and Sean! Thanks for the PR!

A few notes/questions:

  • Is there a way to test this? The issue I had in DeepSpeed + TPU support via transformer's Trainer #97 was that Colab was that TPU training was not even starting, let alone getting to a generate callback. A test Colab would be helpful, as I know pytorch-lightning has changed how it interacts with Colab over time. (I suspect Colab will be the primary use case for this implementation since the TPU there is free.)
  • Thanks for the xm.rendevous adds, that was a bottleneck I was not sure how to fix!
  • If generate_sample_text() becomes too much of an issue to fix (i.e. if the rendezvous take too long to move back to the host system), I am OK with forcing it off for TPU training.

@kaushikb11
Copy link
Contributor Author

Hi @minimaxir,

Is there a way to test this?

I understand the use case, as most of the Users will train on Colab for TPUs. I could work on a test colab, will add it here. It would be a good test to check if it's working as expected.

If generate_sample_text() becomes too much of an issue to fix (i.e. if the rendezvous take too long to >move back to the host system), I am OK with forcing it off for TPU training.

Noted! Let me spend some time on it and try to debug it. I will get back to you regarding it early next week.

@minimaxir
Copy link
Owner

minimaxir commented Mar 24, 2021

The pytorch-lightning update appears to be live now so I can now test this.

Also:

Fixed a bug where all_gather would not work correctly with tpu_cores=8

That may have been the issue I was hitting earlier, so good timing!

@kaushikb11
Copy link
Contributor Author

@minimaxir Awesome! We added few more fixes for TPUs.
Could you try the PyTorch lightning master and give it a try?

@minimaxir
Copy link
Owner

minimaxir commented Mar 29, 2021

Created a draft Colab notebook: https://colab.research.google.com/drive/1MF-3hMdLpxzF7uH-MWJVaxX1u03WnVU_?usp=sharing

However it's having issues with the init changes:

/usr/local/lib/python3.7/dist-packages/aitextgen/train.py in __init__(self, model, dataset, hparams, tokenizer)
     23             dataset,
     24             hparams,
---> 25             tokenizer,
     26         )
     27 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in __setattr__(self, name, value)
    993                     buffers[name] = value
    994                 else:
--> 995                     object.__setattr__(self, name, value)
    996 
    997     def __delattr__(self, name):

AttributeError: can't set attribute

@kaushikb11
Copy link
Contributor Author

@minimaxir I have patched a fix in this PR https://github.com/kaushikb11/aitextgen/blob/tpu_save/fix/aitextgen/train.py#L29

We would have to use self.hyperparameters to set hparams.

@kaushikb11 kaushikb11 marked this pull request as ready for review March 31, 2021 10:26
@minimaxir
Copy link
Owner

I have updated the Colab notebook to correctly pull from your PR branch, make it self-sufficient, and added some test cases.

The good news is that the model now runs on the Colab TPU!

Unfortunately, there's a lot of bad news that prevents merging:

  • Training speed for 8 cores used and 1 core used appears to be roughly the same. (I'm not expecting a perfectly linear speedup but there should be some speedup)
  • Saving the model appears to hang on the TPU rendezvous when training is complete.
  • Interrupting a training does not attempt to save the model back to CPU

Feel free to play with the notebook if I am missing anything but I'm ok with merging once these issues are addressed.

@minimaxir
Copy link
Owner

Same issue/overall performance with the alternate Colab TPU installation via Lightning-AI/pytorch-lightning#6794

@kaushikb11
Copy link
Contributor Author

@minimaxir Noted! Thanks for linking the Notebook. There's an issue regarding the hang on Colab Lightning-AI/pytorch-lightning#6390. We have tagged it as a Priority bug & I will personally look into getting it resolved by next week. We did the testing on GCP TPU + VM, and it is working as expected.

Will also investigate the training speed for 1 vs 8 cores. Thanks, appreciate the feedback!

@kaushikb11 kaushikb11 marked this pull request as draft April 3, 2021 20:36
@kaushikb11
Copy link
Contributor Author

@minimaxir There's been some progress on the reported issues:

Saving the model appears to hang on the TPU rendezvous when training is complete.
Interrupting a training does not attempt to save the model back to CPU

I believe the above TPU issues have been resolved in this PR. It's available in the latest release (v1.2.7). Would love you to try it out. But, there seems to be a tricky issue during ai.generate, where it hangs for 1 TPU core. That needs to be debugged.

Training speed for 8 cores used and 1 core used appears to be roughly the same. (I'm not expecting a perfectly linear speedup but there should be some speedup)

Regarding this, I am afraid you are benchmarking it for the same global steps (num of optimizer calls), whereas training with 8 TPU cores will have a larger effective batch size.

@minimaxir
Copy link
Owner

Thanks for the updates @kaushikb11 ! I just tested the notebook with 1.3.0rc0 and things definitely seem to be improved (can confirm no more hangs in the 8 TPU cores case).

There's still one blocker issue: the trained weights don't seem to get back to the CPU from the TPU correctly, which can be tested on using ai.generate() in a subsequent cell after training the model. This is still the case with the save_pretrained variant made and reloading the model.

There are still a few other issues:

But, there seems to be a tricky issue during ai.generate, where it hangs for 1 TPU core. That needs to be debugged.

It appears ending training also hangs on 1 TPU core but not 8 TPU cores, oddly.

Regarding this, I am afraid you are benchmarking it for the same global steps (num of optimizer calls), whereas training with 8 TPU cores will have a larger effective batch size.

That would make sense, however I'm not sure that's happening in practice. I changed the notebook to fine-tune the 124M GPT-2 w/ batch-size of 1; on paper, an effective batch size of 8 should result in substantially-faster loss decrease than a batch size of 1, however that does not appear to be the case with the test notebook where moving average losses are about the same. I may need to set up better logging to help visualize this.

It may be OK to merge this PR soon (with a note that this behavior is BETA) ; I'll increase the minimum pytorch-lightning requirement to 1.3.0 and add conditions avoiding generation w/ TPU training in another commit.

@kaushikb11 kaushikb11 marked this pull request as ready for review April 11, 2021 19:38
@minimaxir
Copy link
Owner

Rerunning the notebook with the all_gather fix on PL master results in an interesting error:

Screen Shot 2021-04-17 at 11 59 09 AM

Wonder if that explains the core discrepancy above? Either way, 8-TPU cores does not run at all.

@kaushikb11
Copy link
Contributor Author

Hi @minimaxir, there was a bit of refactoring for DDP Spawn last week, and as TPU Spawn was a subclass of it, it was affected. It has been resolved in this PR Lightning-AI/pytorch-lightning#7074

@minimaxir
Copy link
Owner

As of pytorch-lightning 1.3, training now works and completes without hangs! However, the generation is still not correct, implying the model does not go back to the CPU. That should hopefully be debuggable on my end.

Given that + this PR has some fixes for #129 I'll merge this now, but won't advertise TPU support until it's more hardened + have a more up-to-date test notebook.

Thanks, and sorry for the late update!

@kaushikb11
Copy link
Contributor Author

Thanks for the update!!

@swcrazyfan
Copy link

swcrazyfan commented Jun 11, 2021

As of pytorch-lightning 1.3, training now works and completes without hangs! However, the generation is still not correct, implying the model does not go back to the CPU. That should hopefully be debuggable on my end.

Given that + this PR has some fixes for #129 I'll merge this now, but won't advertise TPU support until it's more hardened + have a more up-to-date test notebook.

Thanks, and sorry for the late update!

Do you have any idea when TPU training might be ready? I'd love to use aitextgen with my TRC TPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants