Fix save pretrained for TPUs #105

kaushikb11 · 2021-03-19T11:32:33Z

Related to #97

Hello @minimaxir! 👋🏽

The code hangs for TPUs when it calls xm.save for the save_pretrained method, xm_save is passed to the save_function param for TPUs.

There's one more update on the Lightning side for TPUs. The update will be available in the next patch release on Tuesday next week. Thank you!

SeanNaren · 2021-03-19T14:20:35Z

aitextgen/train.py

@@ -172,10 +176,19 @@ def on_batch_end(self, trainer, pl_module):
                desc += f" — GPU Mem: {gpu_memory} MB"
            self.main_progress_bar.update(self.progress_bar_refresh_rate)
            self.main_progress_bar.set_description(desc)
-
+
+        if _TPU_AVAILABLE and self.save_every_check:


Could we get some more information around this and why it's necessary? I think we can iterate on this to get something cleaner; the idea is that the boilerplate can live within lightning, outside of the user code.

kaushikb11 · 2021-03-19T15:06:54Z

Converting the PR into a Draft. As found an issue during generate_sample_text with TPUs. Will update soon.

minimaxir · 2021-03-19T20:50:12Z

Hi Kaushik and Sean! Thanks for the PR!

A few notes/questions:

Is there a way to test this? The issue I had in DeepSpeed + TPU support via transformer's Trainer #97 was that Colab was that TPU training was not even starting, let alone getting to a generate callback. A test Colab would be helpful, as I know pytorch-lightning has changed how it interacts with Colab over time. (I suspect Colab will be the primary use case for this implementation since the TPU there is free.)
Thanks for the xm.rendevous adds, that was a bottleneck I was not sure how to fix!
If generate_sample_text() becomes too much of an issue to fix (i.e. if the rendezvous take too long to move back to the host system), I am OK with forcing it off for TPU training.

kaushikb11 · 2021-03-19T21:13:38Z

Hi @minimaxir,

Is there a way to test this?

I understand the use case, as most of the Users will train on Colab for TPUs. I could work on a test colab, will add it here. It would be a good test to check if it's working as expected.

If generate_sample_text() becomes too much of an issue to fix (i.e. if the rendezvous take too long to >move back to the host system), I am OK with forcing it off for TPU training.

Noted! Let me spend some time on it and try to debug it. I will get back to you regarding it early next week.

minimaxir · 2021-03-24T22:30:55Z

The pytorch-lightning update appears to be live now so I can now test this.

Also:

Fixed a bug where all_gather would not work correctly with tpu_cores=8

That may have been the issue I was hitting earlier, so good timing!

kaushikb11 · 2021-03-25T15:17:49Z

@minimaxir Awesome! We added few more fixes for TPUs.
Could you try the PyTorch lightning master and give it a try?

minimaxir · 2021-03-29T02:38:26Z

Created a draft Colab notebook: https://colab.research.google.com/drive/1MF-3hMdLpxzF7uH-MWJVaxX1u03WnVU_?usp=sharing

However it's having issues with the init changes:

/usr/local/lib/python3.7/dist-packages/aitextgen/train.py in __init__(self, model, dataset, hparams, tokenizer)
     23             dataset,
     24             hparams,
---> 25             tokenizer,
     26         )
     27 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in __setattr__(self, name, value)
    993                     buffers[name] = value
    994                 else:
--> 995                     object.__setattr__(self, name, value)
    996 
    997     def __delattr__(self, name):

AttributeError: can't set attribute

kaushikb11 · 2021-03-29T08:16:07Z

@minimaxir I have patched a fix in this PR https://github.com/kaushikb11/aitextgen/blob/tpu_save/fix/aitextgen/train.py#L29

We would have to use self.hyperparameters to set hparams.

minimaxir · 2021-04-03T20:00:59Z

I have updated the Colab notebook to correctly pull from your PR branch, make it self-sufficient, and added some test cases.

The good news is that the model now runs on the Colab TPU!

Unfortunately, there's a lot of bad news that prevents merging:

Training speed for 8 cores used and 1 core used appears to be roughly the same. (I'm not expecting a perfectly linear speedup but there should be some speedup)
Saving the model appears to hang on the TPU rendezvous when training is complete.
Interrupting a training does not attempt to save the model back to CPU

Feel free to play with the notebook if I am missing anything but I'm ok with merging once these issues are addressed.

minimaxir · 2021-04-03T20:17:56Z

Same issue/overall performance with the alternate Colab TPU installation via Lightning-AI/pytorch-lightning#6794

kaushikb11 · 2021-04-03T20:33:18Z

@minimaxir Noted! Thanks for linking the Notebook. There's an issue regarding the hang on Colab Lightning-AI/pytorch-lightning#6390. We have tagged it as a Priority bug & I will personally look into getting it resolved by next week. We did the testing on GCP TPU + VM, and it is working as expected.

Will also investigate the training speed for 1 vs 8 cores. Thanks, appreciate the feedback!

kaushikb11 · 2021-04-07T22:59:55Z

@minimaxir There's been some progress on the reported issues:

Saving the model appears to hang on the TPU rendezvous when training is complete.
Interrupting a training does not attempt to save the model back to CPU

I believe the above TPU issues have been resolved in this PR. It's available in the latest release (v1.2.7). Would love you to try it out. But, there seems to be a tricky issue during ai.generate, where it hangs for 1 TPU core. That needs to be debugged.

Training speed for 8 cores used and 1 core used appears to be roughly the same. (I'm not expecting a perfectly linear speedup but there should be some speedup)

Regarding this, I am afraid you are benchmarking it for the same global steps (num of optimizer calls), whereas training with 8 TPU cores will have a larger effective batch size.

minimaxir · 2021-04-10T20:11:03Z

Thanks for the updates @kaushikb11 ! I just tested the notebook with 1.3.0rc0 and things definitely seem to be improved (can confirm no more hangs in the 8 TPU cores case).

There's still one blocker issue: the trained weights don't seem to get back to the CPU from the TPU correctly, which can be tested on using ai.generate() in a subsequent cell after training the model. This is still the case with the save_pretrained variant made and reloading the model.

There are still a few other issues:

But, there seems to be a tricky issue during ai.generate, where it hangs for 1 TPU core. That needs to be debugged.

It appears ending training also hangs on 1 TPU core but not 8 TPU cores, oddly.

Regarding this, I am afraid you are benchmarking it for the same global steps (num of optimizer calls), whereas training with 8 TPU cores will have a larger effective batch size.

That would make sense, however I'm not sure that's happening in practice. I changed the notebook to fine-tune the 124M GPT-2 w/ batch-size of 1; on paper, an effective batch size of 8 should result in substantially-faster loss decrease than a batch size of 1, however that does not appear to be the case with the test notebook where moving average losses are about the same. I may need to set up better logging to help visualize this.

It may be OK to merge this PR soon (with a note that this behavior is BETA) ; I'll increase the minimum pytorch-lightning requirement to 1.3.0 and add conditions avoiding generation w/ TPU training in another commit.

minimaxir · 2021-04-17T19:08:59Z

Rerunning the notebook with the all_gather fix on PL master results in an interesting error:

Wonder if that explains the core discrepancy above? Either way, 8-TPU cores does not run at all.

kaushikb11 · 2021-04-20T16:57:50Z

Hi @minimaxir, there was a bit of refactoring for DDP Spawn last week, and as TPU Spawn was a subclass of it, it was affected. It has been resolved in this PR Lightning-AI/pytorch-lightning#7074

minimaxir · 2021-05-17T02:14:43Z

As of pytorch-lightning 1.3, training now works and completes without hangs! However, the generation is still not correct, implying the model does not go back to the CPU. That should hopefully be debuggable on my end.

Given that + this PR has some fixes for #129 I'll merge this now, but won't advertise TPU support until it's more hardened + have a more up-to-date test notebook.

Thanks, and sorry for the late update!

kaushikb11 · 2021-05-20T00:42:41Z

Thanks for the update!!

swcrazyfan · 2021-06-11T08:09:58Z

As of pytorch-lightning 1.3, training now works and completes without hangs! However, the generation is still not correct, implying the model does not go back to the CPU. That should hopefully be debuggable on my end.

Given that + this PR has some fixes for #129 I'll merge this now, but won't advertise TPU support until it's more hardened + have a more up-to-date test notebook.

Thanks, and sorry for the late update!

Do you have any idea when TPU training might be ready? I'd love to use aitextgen with my TRC TPUs.

Fix save pretrained for TPUs

4ab4224

SeanNaren reviewed Mar 19, 2021

View reviewed changes

kaushikb11 marked this pull request as draft March 19, 2021 15:08

add property for save_every_check

e9ac598

kaushikb11 marked this pull request as ready for review March 31, 2021 10:26

kaushikb11 marked this pull request as draft April 3, 2021 20:36

kaushikb11 marked this pull request as ready for review April 11, 2021 19:38

minimaxir mentioned this pull request May 7, 2021

"can't set attribute" on call to ai.train (GPT-Neo) in Colab notebook #129

Closed

minimaxir merged commit b22a20d into minimaxir:master May 17, 2021

Mimocro mentioned this pull request Jun 15, 2021

TPU UserWarning: The model layers do not match after moving to the target device. Also loading model not work. #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix save pretrained for TPUs #105

Fix save pretrained for TPUs #105

kaushikb11 commented Mar 19, 2021 •

edited

Loading

SeanNaren Mar 19, 2021 •

edited

Loading

kaushikb11 commented Mar 19, 2021 •

edited

Loading

minimaxir commented Mar 19, 2021 •

edited

Loading

kaushikb11 commented Mar 19, 2021

minimaxir commented Mar 24, 2021 •

edited

Loading

kaushikb11 commented Mar 25, 2021

minimaxir commented Mar 29, 2021 •

edited

Loading

kaushikb11 commented Mar 29, 2021

minimaxir commented Apr 3, 2021

minimaxir commented Apr 3, 2021

kaushikb11 commented Apr 3, 2021

kaushikb11 commented Apr 7, 2021

minimaxir commented Apr 10, 2021

minimaxir commented Apr 17, 2021

kaushikb11 commented Apr 20, 2021

minimaxir commented May 17, 2021

kaushikb11 commented May 20, 2021

swcrazyfan commented Jun 11, 2021 •

edited

Loading

Fix save pretrained for TPUs #105

Fix save pretrained for TPUs #105

Conversation

kaushikb11 commented Mar 19, 2021 • edited Loading

SeanNaren Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

kaushikb11 commented Mar 19, 2021 • edited Loading

minimaxir commented Mar 19, 2021 • edited Loading

kaushikb11 commented Mar 19, 2021

minimaxir commented Mar 24, 2021 • edited Loading

kaushikb11 commented Mar 25, 2021

minimaxir commented Mar 29, 2021 • edited Loading

kaushikb11 commented Mar 29, 2021

minimaxir commented Apr 3, 2021

minimaxir commented Apr 3, 2021

kaushikb11 commented Apr 3, 2021

kaushikb11 commented Apr 7, 2021

minimaxir commented Apr 10, 2021

minimaxir commented Apr 17, 2021

kaushikb11 commented Apr 20, 2021

minimaxir commented May 17, 2021

kaushikb11 commented May 20, 2021

swcrazyfan commented Jun 11, 2021 • edited Loading

kaushikb11 commented Mar 19, 2021 •

edited

Loading

SeanNaren Mar 19, 2021 •

edited

Loading

kaushikb11 commented Mar 19, 2021 •

edited

Loading

minimaxir commented Mar 19, 2021 •

edited

Loading

minimaxir commented Mar 24, 2021 •

edited

Loading

minimaxir commented Mar 29, 2021 •

edited

Loading

swcrazyfan commented Jun 11, 2021 •

edited

Loading