Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Illegal memory access error when using coarsening in X2 direction on GPU #226

Closed
Anto6453 opened this issue Feb 2, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Anto6453
Copy link
Contributor

Anto6453 commented Feb 2, 2024

Describe the issue:

When using coarsening on X2 direction (corresponding to $\theta$ in spherical coordinates), I obtain an illegal memory access error when I try to run idefix on GPU (seen on A100 and RTX 2080 SUPER). The same configuration works on CPU.

In my case, I use a 2D-axisym spherical grid, and I divide my radial grid in one uniform grid in the inner domain, and one log grid in the outer domain. To avoid that my uniform grid in $r$ becomes too thin in the $\theta$ direction when it comes close to the center of the domain, in need to coars the grid in the $\theta$ direction.

My coarsening function :

void CoarsenFunction(DataBlock &data) {
<int> coarseningLevel = data.coarseningLevel[JDIR];
  IdefixArray1D<real> r = data.x[IDIR];
  real radiusCoars = radiusCoarsGlob;
  int coarsIncrement = coarsIncrementGlob;
  
  idefix_for("set_coarsening", 0, data.np_tot[KDIR], 0, data.np_tot[IDIR],
	     KOKKOS_LAMBDA(int k,int i) {
	       if (r(i) <= radiusCoars) {
		     coarseningLevel(k,i) = 1 + log2(radiusCoars/r(i)) + coarsIncrement;
	       }
	       else {coarseningLevel(k,i) = 1;}
	     });
  data.coarseningLevel[JDIR] = coarseningLevel;
}

where radiusCoars is the radius that delimits the uniform and the log grids.

The error seems to occure when the number of points become too big.
For example :
- with one process on one RTX 2080 SUPER, the error seems to occure when the total number of point comes close to 2^22 (works for 2^21 points but not for 2^22 points).
- with two mpi processes (on two RTX 2080 SUPER) the limit increase also by a factor 2.
- increasing the number of point only in the log grid (which is not coarsed) seems also to be a trigger.
- if I remove the coarsening, I don't have this error whatever is the number of point.

It also seems to be triggered more easily when I increase the maximum coarsening level (via coarsIncrement in the coarsening function).

By looking at the log in debug mode, the error seems to appear when idefix is trying to coars the magnetic field while keeping divB equal to zero (in the second FLUID_CoarsenFlow_BXsn idefix_for loop in the src/fluid/coarsenFlow.hpp file).

Thanks in advance for your help.

Error message:

----> Profiler::Init...
----> ...returned
                                  .:HMMMMHn:.  ..:n..
                                .H*'``     `'%HM'''''!x.
         :x                    x*`           .(MH:    `#h.
        x.`M                   M>        :nMMMMMMMh.     `n.
         *kXk..                XL  nnx:.XMMMMMMMMMMML   .. 4X.
          )MMMMMx              'M   `^?M*MMMMMMMMMMMM:HMMMHHMM.
          MMMMMMMX              ?k    'X ..'*MMMMMMM.#MMMMMMMMMx
         XMMMMMMMX               4:    M:MhHxxHHHx`MMx`MMMMMMMMM>
         XM!`   ?M                `x   4MM'`''``HHhMMX  'MMMMMMMM
         4M      M                 `:   *>     `` .('MX   '*MMMM'
          MX     `X.nnx..                        ..XMx`     'M*X
           ?h.    ''^'*!Hx.     :Mf     xHMh  M**MMM      4L`
            `*Mx           `'*n.x. 4M>   :M` `` 'M    `       %
             '%                ``*MHMX   X>      !
            :!                    `#MM>  X>      `   :x
           :M                        ?M  `X     .  ..'M
           XX                       .!*X  `x   XM( MMx`h
          'M>::                        `M: `+  MMX XMM `:
          'M> M                         'X    'MMX ?MMk.Xx..
          'M> ?L                     ...:!     MMX.H**'MMMM*h
           M>  #L                  :!'`MM.    . X*`.xHMMMMMnMk.
           `!   #h.      :L           XM'*hxHMM*MhHMMMMMMMMMM'#h
           +     XMh:    4!      x   :f   MM'   `*MMMMMMMMMM%  `X
           M     Mf``tHhxHM      M>  4k xxX'      `#MMMMMMMf    `M .>
          :f     M   `MMMMM:     M>   M!MMM:         '*MMf'     'MH*
          !     Xf   'MMMMMX     `X   X>'h.`          :P*Mx.   .d*~..
        :M      X     4MMMMM>     !   X~ `Mh.      .nHL..M#'%nnMhH!'`
       XM      d>     'X`'**h     'h  M   ^'MMHH+*'`  ''''   `'**'
    %nxM>      *x+x.:. XL.. `k     `::X
:nMMHMMM:.  X>  Mn`*MMMMMHM: `:     ?MMn.
    `'**MML M>  'MMhMMMMMMMM  #      `M:^*x
         ^*MMttnnMMMMMMMMMMMH>.        M:.4X
                        `MMMM>X   (   .MMM:MM!   .
                          `'''4x.dX  +^ `''MMMMHM?L..
                                ``'           `'`'`'`

              Idefix version 2.0.04-00e3db93
              Built against Kokkos 30500


Main: Initialization stage.
----> Grid::Grid(Input)...
----> ...returned
----> GridHost::GridHost(Grid)...
----> ...returned
----> GridHost::MakeGrid...
----> ...returned
----> GridHost::SyncToDevice...
----> ...returned
----> DataBlock::DataBlock...
--------> GridHost::GridHost(Grid)...
--------> ...returned
--------> GridHost::SyncFromDevice...
--------> ...returned
--------> DataBlock::ExtractSubdomain...
------------> idefix_for(coordinates)...
------------> ...returned
------------> idefix_for(coordinates)...
------------> ...returned
------------> idefix_for(coordinates)...
------------> ...returned
--------> ...returned
--------> DataBlock::MakeGeometry()...
------------> idefix_for(init_coarsening)...
------------> ...returned
------------> idefix_for(Volumes)...
------------> ...returned
------------> idefix_for(GeometricalCentersX1)...
------------> ...returned
------------> idefix_for(GeometricalCentersX2)...
------------> ...returned
------------> idefix_for(GeometricalCentersX3)...
------------> ...returned
------------> idefix_for(AreaX1)...
------------> ...returned
------------> idefix_for(AreaX2)...
------------> ...returned
------------> idefix_for(AreaX3)...
------------> ...returned
--------> ...returned
--------> Dump::Init...
--------> ...returned
--------> GridHost::GridHost(Grid)...
--------> ...returned
--------> GridHost::SyncFromDevice...
--------> ...returned
--------> Fluid::Fluid...
------------> StateContainer::PushArray...
------------> ...returned
------------> StateContainer::PushArray...
------------> ...returned
------------> idefix_for(ComputePLMweights)...
------------> ...returned
------------> idefix_for(ComputePLMweights)...
------------> ...returned
------------> ConstrainedTransport::Init...
------------> ...returned
------------> Boundary::Boundary...
Phys MHD
------------> ...returned
--------> ...returned
--------> Gravity::Gravity...
------------> SelfGravity::Init...
----------------> Laplacian::Laplacian...
--------------------> Laplacian::InitInternalGrid...
------------------------> GridHost::GridHost(Grid)...
------------------------> ...returned
------------------------> GridHost::SyncFromDevice...
------------------------> ...returned
------------------------> idefix_for(InternalGridCopy)...
------------------------> ...returned
------------------------> idefix_for(InternalGridFill)...
------------------------> ...returned
------------------------> idefix_for(Volumes)...
------------------------> ...returned
------------------------> idefix_for(AreaX1)...
------------------------> ...returned
------------------------> idefix_for(AreaX2)...
------------------------> ...returned
------------------------> idefix_for(AreaX3)...
------------------------> ...returned
--------------------> ...returned
--------------------> Laplacian::InitPreconditioner...
------------------------> idefix_for(ResetPrecond)...
------------------------> ...returned
------------------------> idefix_for(InitPrecond)...
------------------------> ...returned
------------------------> idefix_for(InitPrecond)...
------------------------> ...returned
--------------------> ...returned
--------------------> Laplacian::PreComputeLaplacian...
------------------------> idefix_for(L_Factor)...
------------------------> ...returned
--------------------> ...returned
----------------> ...returned
----------------> idefix_for(InitDensity)...
----------------> ...returned
------------> ...returned
--------> ...returned
----> ...returned
----> TimeIntegrator::TimeIntegrator(Input...)...
--------> StateContainer::AllocateAs...
--------> ...returned
----> ...returned
----> Output::Output...
----> ...returned
Setup:: Coarsening is on.
----> Output::EnrollUserDefVariable...
----> ...returned
Main: initialisation finished.
-----------------------------------------------------------------------------
Input Parameters using input file idefix.ini:
-----------------------------------------------------------------------------
[Boundary]
        X1-beg          userdef
        X1-end          outflow
        X2-beg          axis
        X2-end          reflective
        X3-beg          periodic
        X3-end          periodic
[Gravity]
        gravCst         39.476926408897626
        potential               selfgravity
        skip            1
[Grid]
        ;               X1-grid 2       2e-4    196     u       1e-2    2640    l       5.e3    X2-grid 1       0.0     512       u       1.570796326794896
        X1-grid         2       2e-4    98      u       1e-2      1320        l     5.e3
        X2-grid         1       0.0     1024    u       1.570796326794896
        X3-grid         1       0.0     1       u       6.283185307179586
        coars_increment         1
        coarsening              static  X2
        radiusCoars             1e-2
[Hydro]
        csiso           userdef
        gamma           1.666667
        solver          hll
[Output]
        dmp             0.001
        dmp_dir         ./Try5_resartTry3_newBoundary/
        log             100
        uservar         PhiP    Cs      InvDt   dV      T       AX3
        vtk             0.001
        vtk_dir         ./Try5_resartTry3_newBoundary/
[SelfGravity]
        boundary-X1-beg         origin
        boundary-X1-end         nullpot
        boundary-X2-beg         axis
        boundary-X2-end         nullgrad
        boundary-X3-beg         periodic
        boundary-X3-end         periodic
        maxIter         10000
        skip            1
        solver          PBICGSTAB
        targetError             1e-4
[Setup]
        Mc              1.
        T0              10.
        alpha           0.25
        beta            0.0
        mu              4.
[TimeIntegrator]
        CFL             0.8
        CFL_max_var             1.1
        check_nan               100
        first_dt                1.e-7
        max_runtime             96
        maxdivB         1e-2
        nstages         2
        tstop           24000.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
Input: Compiled with DOUBLE PRECISION arithmetic.
Input: DIMENSIONS=2.
Input: COMPONENTS=3.
Input: Kokkos CUDA target ENABLED.
Grid: full grid size is 
         Direction X1: userdef  0.01....4096....5000    outflow
         Direction X2: axis     0....1024....1.5708     reflective
Grid: static grid coarsening enabled in direction(s) X2 
Hydro: solving MHD equations.
Hydro: Using EXPERIMENTAL vector potential formulation for MHD.
Hydro: Reconstruction: 2nd order (PLM Van Leer)
EquationOfState: isothermal with user-defined cs function.
RiemannSolver: hll (MHD).
ConstrainedTransport: Using UCT_CONTACT averaging scheme.
Axis: Axis regularisation ENABLED.
Axis: Full 2pi regularisation around the axis.
Gravity: ENABLED.
Gravity: G=39.4769.
Gravity: self-gravity ENABLED.
SelfGravity: Using preconditionned BICGSTAB solver.
SelfGravity: using origin boundary with 280 additional radial points.
----> Bicgstab::ShowConfig...
Bicgstab: TargetError: 0.0001
Bicgstab: Maximum iterations: 10000
----> ...returned
TimeIntegrator: using 2nd Order (RK2) integrator.
TimeIntegrator: Using adaptive dt with CFL=0.8 .
TimeIntegrator: will stop after 96 hours.
Main: Creating initial conditions.
----> Setup::Initflow...
--------> DataBlockHost::DataBlockHost(DataBlock)...
--------> ...returned
InnitFlow:: Sarting a new collapse.
--------> DataBlockHost::SyncToDevice()...
--------> ...returned
----> ...returned
----> ConstrainedTransport::ComputeMagFieldfromA...
--------> idefix_for(ComputeMagFieldFromA)...
--------> ...returned
--------> Fluid::CoarsenMagField...
------------> idefix_for(FLUID_CoarsenFlow_BXsn)...
------------> ...returned
------------> idefix_for(FLUID_CoarsenFlow_BXsn)...
------------> ...returned
--------> ...returned
----> ...returned
----> DataBlock::ComputeGridCoarseningLevels...
--------> User-defined Coarsening function...
------------> idefix_for(set_coarsening)...
------------> ...returned
--------> ...returned
--------> DataBlock::CheckCoarseningLevels()...
------------> DataBlockHost::DataBlockHost(DataBlock)...
------------> ...returned
------------> DataBlockHost::SyncFromDevice()...
------------> ...returned
--------> ...returned
----> ...returned
----> Fluid::CoarsenFlow...
--------> idefix_for(FLUID_CoarsenFlow)...
--------> ...returned
----> ...returned
----> Fluid::CoarsenMagField...
--------> idefix_for(FLUID_CoarsenFlow_BXsn)...
--------> ...returned
--------> idefix_for(FLUID_CoarsenFlow_BXsn)...
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/aborderi/src/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:150
Traceback functionality not available

Aborted

runtime information:

Idefix version 2.0.04-00e3db93
Master branch

Kokkos 30500

Seen on GPU A100 and RTX 2080 SUPER

@Anto6453 Anto6453 added the bug Something isn't working label Feb 2, 2024
@glesur
Copy link
Contributor

glesur commented Feb 6, 2024

Hi @Anto6453 , thanks for the report. In order to reproduce this, could you send me the full setup you are using? (i.e. setup.cpp+idefix.ini) (can be by email if your don't want your setup to appear publicly).

@glesur
Copy link
Contributor

glesur commented Feb 23, 2024

Problem is that gridCoarsening tries to reconstruct the normal component of B from the divergence of the two tangential components of B stored in Vs. When DIMENSIONS<3, one of these components is not defined (because it's not stored on cell faces), resulting in a segfault (as it turns out, also on CPUs). The fix #230 will be part of Idefix v2.0.05 release.

@glesur
Copy link
Contributor

glesur commented Mar 29, 2024

v2.0.05 released with this bug fixed.

@glesur glesur closed this as completed Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants