support layer decay and different lr for text/visual encoder #268

Quan-Sun · 2022-11-29T07:36:26Z

Signed-off-by: quansun [email protected]
Add new features:

layer decay with value depending on layer
support for different lr for text and image

Signed-off-by: quansun <[email protected]>

Quan-Sun · 2022-12-07T14:40:47Z

Just a follow-up. Is anyone taking a look?

gabrielilharco · 2022-12-08T23:52:15Z

Hi @Quan-Sun, thanks for the PR! Can you give more context on why this is helpful? Have you observed better results with either of these changes?

Signed-off-by: Sun Quan <[email protected]>

Quan-Sun · 2022-12-10T03:19:15Z

Hello Gabriel, thanks for your reply! This PR is for layer decay and different lr for text/visual encoder. Learning rate layer decay is a common trick when we train a big model loading pre-trained weights. I think text and visual encoders are different due to natural differences between image and text, so different learning rates and values of learning rate layer decay should be applied.
We use this to train EVA-CLIP which is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance and achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches.
I also created a PR for the EVA-CLIP model #265.
In #264, I add deepspped and zero-stage-1 optim. We can train 41k bsz with only 256 GPUs(A100-40G ) with grad checkpoint & deepspeed fp16 & zero-stage-1 .

Quan-Sun · 2022-12-10T03:21:39Z

p.s. bsz can achieve 57k when using grad checkpoint & deepspeed fp16 & zero-stage-1 & local loss

Signed-off-by: Sun Quan <[email protected]>

rwightman · 2022-12-10T22:24:39Z

@Quan-Sun @gabrielilharco I think the changes are reasonable, I use layer decay extensively in timm fine-tuning these days, and I feel it'd be useful here, especially when initializing one or both of the towers w/ pretrained weights.

Also, I was thinking of pushing optimizer creation into a factory method as well since I wanted to try some other optimizers at some point and that'd make it a bit cleaner.

Question (for Quan-Sun), why are the assigners in main? they're just created and passed to the optim factory, wouldn't it be cleaner to create them in the factory since they're not needed in main?

gabrielilharco · 2022-12-10T22:38:57Z

src/training/main.py

+
+    if visual_ld < 1.0:
+        visual_num_layers = model_without_ddp.visual.get_num_layers()
+        assigner_visual = LayerDecayValueAssigner(list(visual_ld ** (visual_num_layers + 1 - i) for i in range(visual_num_layers + 2)))


consider using a more descriptive variable name here, explaining what is being assigned (e.g. lr_assigner_visual). Also is there any reason why the exponent logic is not inside the Assigner class? I.e. the constructor could take in the layer decay and number of layers, and compute the values accordingly

+1 on that, plus getting this out of main

gabrielilharco · 2022-12-10T22:39:17Z

src/training/main.py

+    else:
+        assigner_visual = None
+
+    if text_ld < 1.0:


should this be != 1.0?

same for visual_ld above

gabrielilharco · 2022-12-10T22:40:39Z

src/training/optim.py

+        "text.token_embedding",
+        "text.transformer.embeddings.word_embeddings",
+        "text.transformer.embeddings.position_embeddings",
+        "text.transformer.embeddings.token_type_embeddings"


will this work with all architectures we support? E.g. text models from HF

@gabrielilharco no that's something else that needs to be addressed , as implemented it only works for built-in vit + text transformer, needs to at least detect if it will work and warn that it cannot be used for other models...

It'd be very useful for pretrained timm and HF text models, timm has functions that can calculate the layer decay but needs to be called if a timm model is used (that can be a diff PR), and not sure if HF has any built-in support to calculate layer-decay (discriminative LR) in a general way...

gabrielilharco · 2022-12-10T22:41:28Z

src/training/optim.py

+    parameters = get_all_parameters(args, model, assigner_visual, assigner_text)
+
+    optimizer_args = dict(
+            betas=(args.beta1, args.beta2),


nit: extra tab?

gabrielilharco · 2022-12-10T22:42:01Z

Thanks for the context @Quan-Sun @rwightman. I agree with @rwightman re. the assigners, and also left some other minor comments there. I'll test it from my side after the changes. Thanks!

rwightman · 2022-12-10T22:46:39Z

FWIW, timm's LD impl is here https://github.com/rwightman/pytorch-image-models/blob/e98c93264cde1657b188f974dc928b9d73303b18/timm/optim/optim_factory.py#L92-L153 ... all models have a fn that returns the group metadata, and the grouper fn can be used, so that would be basis for apply LD for vision tower if timm tower is used

Quan-Sun · 2022-12-12T01:48:30Z

@gabrielilharco @rwightman Thanks for your comments. I will work on these changes ASAP.

…ide the Assigner class Signed-off-by: Quan Sun <[email protected]>

gabrielilharco · 2022-12-19T18:35:50Z

Thanks for the update @Quan-Sun. IIUC this would still only work for built-in vit + text transformers, is this right? As Ross pointed out, we should at least detect if this is not the case and warn users for models that are not supported

Quan-Sun · 2022-12-21T10:14:54Z

Hi @gabrielilharco. You are right. get_num_layer_for_transformer(...) is not flexible. It should warn users if the models are not supported. Do you think we can have a white list here? For example, white_list = ["visual.blocks", "visual.transformer.resblocks", "text.transformer.resblocks", "text.transformer.encoder.layer"], then detecting if the model_param_name is in this white_list.

gabrielilharco · 2022-12-21T17:50:01Z

Yes, that should work. If we're being very conservative we could also whitelist specific args.model values --- the downside is we'd need to update the list whenever we add new models

add a white_list then detecting if the model_param_name is in this white_list

gabrielilharco · 2022-12-22T19:32:56Z

Thanks for the update @Quan-Sun! Could you check if “Allow edits from maintainers.” is checked on your side? I want to do some small changes before merging

Quan-Sun · 2022-12-23T01:55:54Z

Hi @gabrielilharco. have checked "Allow edits from maintainers." on my side. Please let me know if anything was missed.

support layer decay and different lr for text/visual encoder

293b336

Signed-off-by: quansun <[email protected]>

fix: wrong lr_scale to different layers

c83a408

Signed-off-by: Sun Quan <[email protected]>

fix: add visual./text. to param name

3432bcd

Signed-off-by: Sun Quan <[email protected]>

gabrielilharco reviewed Dec 10, 2022

View reviewed changes

getting lr_scale_assigner out of main & moving the exponent logic ins…

47a0614

…ide the Assigner class Signed-off-by: Quan Sun <[email protected]>

Quan-Sun and others added 2 commits December 22, 2022 23:04

add checker for lr layer decay

5221dc2

add a white_list then detecting if the model_param_name is in this white_list

Merge branch 'main' into feat-layer-decay

caa50b9

Clean up and fix weight decay assignment

74b0571

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support layer decay and different lr for text/visual encoder #268

support layer decay and different lr for text/visual encoder #268

Quan-Sun commented Nov 29, 2022

Quan-Sun commented Dec 7, 2022

gabrielilharco commented Dec 8, 2022

Quan-Sun commented Dec 10, 2022

Quan-Sun commented Dec 10, 2022

rwightman commented Dec 10, 2022 •

edited

Loading

gabrielilharco Dec 10, 2022

rwightman Dec 10, 2022

gabrielilharco Dec 10, 2022

gabrielilharco Dec 10, 2022

gabrielilharco Dec 10, 2022

rwightman Dec 10, 2022

gabrielilharco Dec 10, 2022

gabrielilharco commented Dec 10, 2022

rwightman commented Dec 10, 2022

Quan-Sun commented Dec 12, 2022

gabrielilharco commented Dec 19, 2022

Quan-Sun commented Dec 21, 2022

gabrielilharco commented Dec 21, 2022

gabrielilharco commented Dec 22, 2022

Quan-Sun commented Dec 23, 2022

support layer decay and different lr for text/visual encoder #268

Are you sure you want to change the base?

support layer decay and different lr for text/visual encoder #268

Conversation

Quan-Sun commented Nov 29, 2022

Quan-Sun commented Dec 7, 2022

gabrielilharco commented Dec 8, 2022

Quan-Sun commented Dec 10, 2022

Quan-Sun commented Dec 10, 2022

rwightman commented Dec 10, 2022 • edited Loading

gabrielilharco Dec 10, 2022

Choose a reason for hiding this comment

rwightman Dec 10, 2022

Choose a reason for hiding this comment

gabrielilharco Dec 10, 2022

Choose a reason for hiding this comment

gabrielilharco Dec 10, 2022

Choose a reason for hiding this comment

gabrielilharco Dec 10, 2022

Choose a reason for hiding this comment

rwightman Dec 10, 2022

Choose a reason for hiding this comment

gabrielilharco Dec 10, 2022

Choose a reason for hiding this comment

gabrielilharco commented Dec 10, 2022

rwightman commented Dec 10, 2022

Quan-Sun commented Dec 12, 2022

gabrielilharco commented Dec 19, 2022

Quan-Sun commented Dec 21, 2022

gabrielilharco commented Dec 21, 2022

gabrielilharco commented Dec 22, 2022

Quan-Sun commented Dec 23, 2022

rwightman commented Dec 10, 2022 •

edited

Loading