You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the Bug
We try to train T5 model on A100 GPUs. If we disable FP16, training works fine. However, we want to enable FP16 to train large model on A100. After enable FP16, we observe that
t5 model with 770 M parameters, we see overflow
[2023-04-13 20:12:05,941] [INFO] [fused_optimizer.py:383:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0
[2023-04-13 20:12:05,941] [INFO] [logging.py:68:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale:
256.0, reducing to 128.0
t5 model with 3B parameters, we have SIGSEGV. Debugging core dumps using gdb, it crashes at
#0 0x00007f8f742dac4b in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#8}::operator()() const::{lambda(long)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<long>)#2}> >(long, char**, long const*, long, long) () from /opt/bb/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
The pre-trained model is enabled with bf16. Then the floating numbers from pre-trained model might be overflew if using fp16 in fine tuning. We fine the model with bf16 and did not see any issues.
Describe the Bug
We try to train T5 model on A100 GPUs. If we disable FP16, training works fine. However, we want to enable FP16 to train large model on A100. After enable FP16, we observe that
770 M
parameters, we see overflow3B
parameters, we have SIGSEGV. Debugging core dumps using gdb, it crashes atMinimal Steps/Code to Reproduce the Bug
Expected Behavior
Environment
Our dependence is
Our cuda version is
11.7.1
, cuda driver version is50.47.03
GPU product type is
NVIDIA-A100-SXM4-80GB
.The text was updated successfully, but these errors were encountered: