How to avoid overflow and SIGSEGV with FP16 on A100 #1640

wengyao04 · 2023-04-13T20:40:27Z

Describe the Bug
We try to train T5 model on A100 GPUs. If we disable FP16, training works fine. However, we want to enable FP16 to train large model on A100. After enable FP16, we observe that

t5 model with 770 M parameters, we see overflow

[2023-04-13 20:12:05,941] [INFO] [fused_optimizer.py:383:_update_scale] Reducing dynamic loss scale from 256.0 to 128.0
[2023-04-13 20:12:05,941] [INFO] [logging.py:68:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 
256.0, reducing to 128.0

t5 model with 3B parameters, we have SIGSEGV. Debugging core dumps using gdb, it crashes at

#0  0x00007f8f742dac4b in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#8}::operator()() const::{lambda(long)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<long>)#2}> >(long, char**, long const*, long, long) () from /opt/bb/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so

Minimal Steps/Code to Reproduce the Bug

Expected Behavior

Environment

Our dependence is

"torch==1.13.0",
"transformers==4.19.2",
"deepspeed[autotuning]==0.8.0",
"sentencepiece",
"protobuf==3.20.1",
"mpi4py",

Our cuda version is 11.7.1, cuda driver version is 50.47.03

GPU product type is NVIDIA-A100-SXM4-80GB.

The text was updated successfully, but these errors were encountered:

wengyao04 · 2023-04-14T17:05:27Z

The pre-trained model is enabled with bf16. Then the floating numbers from pre-trained model might be overflew if using fp16 in fine tuning. We fine the model with bf16 and did not see any issues.

wengyao04 added the bug Something isn't working label Apr 13, 2023

wengyao04 closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid overflow and SIGSEGV with FP16 on A100 #1640

How to avoid overflow and SIGSEGV with FP16 on A100 #1640

wengyao04 commented Apr 13, 2023

wengyao04 commented Apr 14, 2023

How to avoid overflow and SIGSEGV with FP16 on A100 #1640

How to avoid overflow and SIGSEGV with FP16 on A100 #1640

Comments

wengyao04 commented Apr 13, 2023

wengyao04 commented Apr 14, 2023