Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Dynamic GPU memory in EnsembleGPUKernel for higher number of threads when using ContinuousCallback #219

Open
martin-abrudsky opened this issue Jan 9, 2023 · 6 comments
Assignees

Comments

@martin-abrudsky
Copy link

Hello, I was testing the new updates to Terminete! with EnsembleGPUKernel. It Works fine with DiscreteCallback, however when using ContinuousCallback I still have the problem, Out of Dynamic GPU memory in EnsembleGPUKernel for higher number of threads. I attach the code used

using StaticArrays
using CUDA
using DiffEqGPU
using NPZ
using OrdinaryDiffEq
using Plots

"""
     pot_central(u,p,t)
     u=[x,dx,y,dy]
     p=[k,m]
"""
function pot_central(u,p,t)
      r3 = ( u[1]^2 + u[3]^2 )^(3/2)
     du1 = u[2]                           # u[2]= dx
     du2 =  -( p[1]*u[1] ) / ( p[2]*r3 )    
     du3 = u[4]                           # u[4]= dy
     du4 =  -( p[1]*u[3] ) / ( p[2]*r3 ) 

     return SVector{4}(du1,du2,du3,du4)
end

T = 100.0
 k = 1.0
 m = 1.0
trajectories = 5_000
u_rand = convert(Array{Float64},npzread("IO_GPU/IO_u0.npy"))

    u0 =  @SVector [2.0; 2.0; 1.0; 1.5]
     p =  @SVector [k,m]                      
 tspan =  (0.0,T)   

     prob = ODEProblem{false}(pot_central,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob, u0 = SVector{4}(u_rand[i,:]).*u0 + @SVector [1.0;1.0;1.0;1.0] )
Ensemble_Problem = EnsembleProblem(prob,prob_func=prob_func,safetycopy=false)


function condition(u,t,integrator)
    R2 = @SVector [4.5,5_000.0]                          # R2=[Rmin2,Rmax2]   
    r2 = u[1]*u[1] + u[3]*u[3]
    (R2[2] - r2)*(r2 - R2[1])#< 0.0
end

affect!(integrator) = terminate!(integrator)
gpu_cb =  ContinuousCallback(condition, affect!;save_positions=(false,false),rootfind=true,interp_points=0,abstol=1e-7,reltol=0)
#gpu_cb =  DiscreteCallback(condition, affect!;save_positions=(false,false))

CUDA.@time sol= solve(Ensemble_Problem,
                                GPUTsit5(),
                                #GPUVern7(),
                                #GPUVern9(),
                                EnsembleGPUKernel(),
                                trajectories = trajectories,
                                batch_size = 10_000,
                                adaptive = false,
                                dt = 0.01,
                                save_everystep = false,
                                callback = gpu_cb,
                                merge_callbacks = true
                                )
@ChrisRackauckas
Copy link
Member

What GPU? A100? Is it just the memory scaling? Is it fine with a higher dt?

@martin-abrudsky
Copy link
Author

martin-abrudsky commented Jan 9, 2023

The GPU is A30. This is the error that comes out for trajectories = 50_000 and dt=0.1

Output excedes the [size limit]. Open the full output data [in a text editor]
ERROR: Out of Dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for devic
Output exceeds the [size limit]. Open the full output data [in a text editor]
e stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
...
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug lev
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces.
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
...
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (nu
Output exceeds the [size limit]. Open the full output data [in a text editor]
ll) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution.
...
Run Julia on debug level 2 for device staERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device staERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device stERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device stERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for devic
Excessive output truncated after 542774 bytes.
Output exceeds the [size limit]. Open the full output data [in a text editor]
KernelException: exception thrown during kernel execution on device NVIDIA A30

Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/exceptions.jl:34
  [2] synchronize(stream::CuStream; blocking::Nothing)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/stream.jl:134
  [3] synchronize
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/stream.jl:121 [inlined]
  [4] (::CUDA.var"#185#186"{SVector{4, Float64}, Matrix{SVector{4, Float64}}, Int64, CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/array.jl:420
  [5] #context!#63
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/state.jl:164 [inlined]
  [6] context!
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/state.jl:159 [inlined]
  [7] unsafe_copyto!(dest::Matrix{SVector{4, Float64}}, doffs::Int64, src::CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/array.jl:406
  [8] copyto!
    @ ~/.julia/packages/CUDA/Ey3w2/src/array.jl:360 [inlined]
  [9] copyto!
    @ ~/.julia/packages/CUDA/Ey3w2/src/array.jl:364 [inlined]
 [10] copyto_axcheck!(dest::Matrix{SVector{4, Float64}}, src::CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer})
    @ Base ./abstractarray.jl:1127
 [11] Array
    @ ./array.jl:626 [inlined]
...
    @ ~/.julia/packages/CUDA/Ey3w2/src/utilities.jl:25 [inlined]
 [18] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/pool.jl:490 [inlined]
 [19] top-level scope
    @ ~/FAMAF/Beca_CIN_Trabajo_Final/skymap/GPU_Julia/pot_central_GPU_Float64.ipynb:0

@ChrisRackauckas
Copy link
Member

Smaller batches or higher dt? Did you calculate out the batch memory size requirement?

@martin-abrudsky
Copy link
Author

For trajectories=5_000 and dt=0.1, the first time I ran the code it worked, but the second time I get the error

Using DiscreteCallback, I tested it with trajectories=10_000_000 and dt=0.01 and it Works fine. In version 1.24 of the library I had the same error.

@martin-abrudsky
Copy link
Author

It also fails for trajectories = 5_000 dt=0.1 and batch_size = 1_000

@maleadt
Copy link
Contributor

maleadt commented Jan 9, 2023

This happens due to an allocation within a kernel (in the case of StaticArrays-code typically due to escape analysis going wrong). You can spot it by prefixing code that launches kernels with @device_code_llvm dump_module=true and looking for calls to @gpu_gc_pool_alloc or @gpu_malloc:

julia> @device_code_llvm dump_module=true solve(Ensemble_Problem,
                                       GPUTsit5(),
                                       #GPUVern7(),
                                       #GPUVern9(),
                                       EnsembleGPUKernel(),
                                       trajectories = trajectories,
                                       batch_size = 10_000,
                                       adaptive = false,
                                       dt = 0.01,
                                       save_everystep = false,
                                       callback = gpu_cb,
                                       merge_callbacks = true
                                       )
;  @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/perform_step/gpu_tsit5_perform_step.jl:85 within `tsit5_kernel`
; ┌ @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/integrators/types.jl:320 within `gputsit5_init`
; │┌ @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/integrators/types.jl:13 within `GPUTsit5Integrator`
    %31 = call fastcc {}* @gpu_gc_pool_alloc([1 x i64] %state, i64 912)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants