Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler bug: broken random number generator with cce on Derecho. #495

Closed
mjs2369 opened this issue Jun 20, 2023 · 10 comments
Closed

compiler bug: broken random number generator with cce on Derecho. #495

mjs2369 opened this issue Jun 20, 2023 · 10 comments
Assignees
Labels
Bug Something isn't working Derecho issues related to running on NCAR's new supercomputer

Comments

@mjs2369
Copy link
Contributor

mjs2369 commented Jun 20, 2023

🐛 Your bug may already be reported!

Describe the bug

The random number generator code in DART will not compile with cce on Derecho.
Edit: the code compiles, but gives incorrect results
More specifically, the subroutines init_ran, ran_unif, ran_gauss, and ran_gamma are all incompatible with cce. I believe this is because they all make use of code from the GNU Scientific Library:

image

image

  1. List the steps someone needs to take to reproduce the bug.
    Run ./filter with any model with "perturb_from_single_instance = .true." in the namelist OR run ./test_gaussian or ./test_gamma in DART/developer_tests/random_seq/work.

  2. What was the expected outcome?
    The executables run successfully.

  3. What actually happened?
    An run-time error halts the execution

Error Message

Please provide any error messages.

ERROR FROM:
source : random_seq_mod.f90
routine: ran_gauss
message: if both x and y are -1, random number generator probably not initialized
message: ... x, y = -3510081565.7593699, -295496494.86667526

image
image
image

actual mean should be close to .50
image

Which model(s) are you working with?

All models, also the test_gaussian, test_random, and test_gamma developer tests in DART/developer_tests/random_seq/work.

Version of DART

Which version of DART are you using?
You can find the version using git describe --tags

v10.7.3

Have you modified the DART code?

No

Build information

Please describe:

  1. The machine you are running on (e.g. windows laptop, NCAR supercomputer Cheyenne).
  2. The compiler you are using (e.g. gnu, intel).

Derecho, cce

@mjs2369 mjs2369 added the Bug Something isn't working label Jun 20, 2023
@NCAR NCAR deleted a comment from nancycollins Jun 21, 2023
@hkershaw-brown
Copy link
Member

@mjs2369 just chatting to Jeff about this.
A better test to look at what is going on is to generate the sequence of random numbers from a given seed. So k is what we are interested in:

! at this point we have an integer value for k
! this routine returns 0.0 <= real < 1.0, so do
! the divide here. return range: [0,1).
ran_unif = real(real(k, digits12) / 4294967296.0_digits12, r8)

and this should be the same across compilers.

hkershaw-brown added a commit that referenced this issue Jun 21, 2023
hkershaw-brown added a commit that referenced this issue Jun 21, 2023
@hkershaw-brown
Copy link
Member

@hkershaw-brown
Copy link
Member

For the curious: gfortran, intel on my mac and derecho give this for k. 11 numbers seeded with 13:

k= 3340206418
k= 2608511152
k= 1020231754
k= 3691240976
k= 3540249318
k= 3835331426
k= 4147861236
k= 769458329
k= 4177289964
k= 3258093498
k= 1947549667

cce on derecho gives:
k= -5939786187531372199
k= -7603175559541411156
k= 2499092022097743661
k= -6392185873013553955
k= 1418358412448069790
k= 1601992904522816967
k= 4918056359950545492
k= -7859870468495140367
k= -5366954201424499693
k= 4633693547982415675
k= -5357398119243707470

@hkershaw-brown
Copy link
Member

Chatting to Marlee, we think this might be a compiler bug:

hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load intel
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ cat boz_dart.f90 
program boz_dart

implicit none

integer, parameter :: i8 = SELECTED_INT_KIND(13)

! hexadecimal constants
integer(i8), parameter :: UPPER_MASK  = int(z'0000000080000000', i8) 
integer(i8), parameter :: LOWER_MASK  = int(z'000000007FFFFFFF', i8) 
integer(i8), parameter :: FULL32_MASK = int(z'00000000FFFFFFFF', i8) 
integer(i8), parameter :: magic       = int(z'000000009908B0DF', i8) 
integer(i8), parameter :: C1          = int(z'000000009D2C5680', i8) 
integer(i8), parameter :: C2          = int(z'00000000EFC60000', i8) 


write(*, '(a, i20, 1x, z16)') "UPPER_MASK  =", UPPER_MASK, UPPER_MASK
write(*, '(a, i20, 1x, z16)') "LOWER_MASK  =", LOWER_MASK, LOWER_MASK
write(*, '(a, i20, 1x, z16)') "FULL32_MASK =", FULL32_MASK, FULL32_MASK
write(*, '(a, i20, 1x, z16)') "magic       =", magic, magic
write(*, '(a, i20, 1x, z16)') "C1          =", C1, C1
write(*, '(a, i20, 1x, z16)') "C2          =", C2, C2

end program boz_dart
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load intel
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ftn boz_dart.f90 
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ./a.out 
UPPER_MASK  =          2147483648         80000000
LOWER_MASK  =          2147483647         7FFFFFFF
FULL32_MASK =          4294967295         FFFFFFFF
magic       =          2567483615         9908B0DF
C1          =          2636928640         9D2C5680
C2          =          4022730752         EFC60000
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ module load cce

Lmod is automatically replacing "intel/2023.0.0" with "cce/15.0.1".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.25     2) hdf5/1.12.2     3) ncarcompilers/1.0.0     4) netcdf/4.9.2

hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ftn boz_dart.f90 
hkershaw@derecho6:/glade/derecho/scratch/hkershaw/test_code$ ./a.out 
UPPER_MASK  =         -2147483648 FFFFFFFF80000000
LOWER_MASK  =          2147483647         7FFFFFFF
FULL32_MASK =                  -1 FFFFFFFFFFFFFFFF
magic       =         -1727483681 FFFFFFFF9908B0DF
C1          =         -1658038656 FFFFFFFF9D2C5680
C2          =          -272236544 FFFFFFFFEFC60000

@hkershaw-brown hkershaw-brown added the Derecho issues related to running on NCAR's new supercomputer label Jun 26, 2023
@hkershaw-brown
Copy link
Member

@mjs2369 Hi Marlee, did a bug report for this get sent to cray (by you or CISL help)?

@mjs2369
Copy link
Contributor Author

mjs2369 commented Jul 10, 2023

@hkershaw-brown I don't believe so. I have a request on for CISL help under "Support wait" where they said they were going to reach out to their contact for any input/fixes, but I haven't heard back from them. I just added another comment to the request to see if there are any updates.

@hkershaw-brown hkershaw-brown changed the title bug: broken random number generator with cce on Derecho. compiler bug: broken random number generator with cce on Derecho. Jul 18, 2023
@mjs2369
Copy link
Contributor Author

mjs2369 commented Aug 16, 2023

@hkershaw-brown

Update on this issue - CISL Support responded to my request after contacting HPE/Cray

This bug was patched in the lastest release of CCE. CISL IT is working to get this installed once HPE has fixed more bugs that others have reported as well. Once a 16.x.x version has been added to the stack on Derecho, I will revisit this pull request to test and hopefully close it.

In the mean time, we will need to keep using Intel to use the random number generator code and therefore perturb_from_single_instance on Derecho.

@hkershaw-brown
Copy link
Member

no new version of CCE on Derecho as of Jan 2023.
Closing as this is a CCE bug rather than a DART bug.

@hkershaw-brown hkershaw-brown closed this as not planned Won't fix, can't repro, duplicate, stale Jan 4, 2024
@hkershaw-brown
Copy link
Member

@c-merchant A new 🎉 cce compiler version cce/16.0.1 is now available on Derecho. Can you give your ran_unif test a spin on Derecho with this new compiler.

For reference, here's your pull request with the random number test: #549
Let's see if cce/16 has the bug fixed.

@hkershaw-brown
Copy link
Member

This bug is fixed in cce/16.0.1 now available on Derecho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Derecho issues related to running on NCAR's new supercomputer
Projects
None yet
Development

No branches or pull requests

3 participants