Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hdf5 (1.8.18?) from bilder causing crashes in puffin on ubuntu 16.04 #52

Open
mightylorenzo opened this issue Jan 23, 2017 · 4 comments
Labels

Comments

@mightylorenzo
Copy link
Collaborator

Using the Bilder built hdf5 and fftw3 libs, when building on Ubuntu 16.04 with the repo compilers, the hdf5 writing routines crash when running Puffin. This is the same as a previous issue, which was assumed fixed. - it still appears to be a problem.

The workaround just now is to build with the Ubuntu repo libs.

@jdasmith
Copy link
Collaborator

jdasmith commented Mar 8, 2017

Is this on all files or is there a specific test input file, number of ranks and machine where we can reproduce this problem?
This sounds like mixing of libs - did you use also system openmpi or bilder openmpi?
Trying to pin this down, as I've not seen the crash.

@jdasmith jdasmith changed the title CMake w Bilder on Ubuntu 16.04 Hdf5 from bilder causing crashes in puffin on ubuntu 16.04 Mar 8, 2017
@jdasmith jdasmith changed the title Hdf5 from bilder causing crashes in puffin on ubuntu 16.04 Hdf5 (1.8.18?) from bilder causing crashes in puffin on ubuntu 16.04 Mar 8, 2017
@mightylorenzo
Copy link
Collaborator Author

This is using system Openmpi with gnu fortran. Using Bilder supplied HDF5 (v 1.8.13, but the same behaviour has now been confirmed on all Bilderized versions from 1.8.12 to 1.8.18) and CMake (v 3.4.1). Ubuntu 16.04. Bilder uses the system OpenMPI and fortran libs to build everything. Everything builds, but when running, we get the below output.

 step size is ---    4.1887903213500980E-003
 ******************************
 
 WARNING - field mesh may not be large enough in z2 - fixing....
 Field mesh length in z2 now =    13.005309677124023     
 
 ******************************
 
 number of nodes in z2 ---         1657
  240 Step(s) and z-bar   1.0053
 There are no dispersive sections
 TRANS AREA =    6.2831854820251465     
 FIXING CHARGE 
 Q =     7.2413410011642439E-009
 SHOT-NOISE TURNED ON
 
 -----------------------------------------
 Total number of macroparticles =          800
 Avg num of real electrons per macroparticle Nk =    8991697.7115548234     
 Total number of real electrons modelled =    7193358169.2438583     
         287
[sebastion:2513] *** An error occurred in MPI_Comm_dup
[sebastion:2513] *** reported by process [2800222209,0]
[sebastion:2513] *** on communicator MPI_COMM_WORLD
[sebastion:2513] *** MPI_ERR_COMM: invalid communicator
[sebastion:2513] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sebastion:2513] ***    and potentially your MPI job)
[sebastion:02507] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[sebastion:02507] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

A bit of fishing around with print statements shows that the MPI error is coming from the calls to h5pset_fapl_mpio_f - i.e.

CALL h5pset_fapl_mpio_f(plist_id, tProcInfo_G%comm, mpiinfo, error)

which is on multiple lines on hdf5PuffColl.f90.

This is for any example Puffin input deck (using hdf5 output, which is now default).

Ubuntu supplied hdf5 seems to work fine.

@jdasmith
Copy link
Collaborator

So I think the other workaround has to be to use bilder to build mpich or openmpi, rather than using ubuntu's system MPI. Would also be interesting to know if this manifests itself also on fedora.
If it happens with bilders mpich, then we need to check no nasty bugs have got into the hdf mpi environment setup, which is going on at this stage. (pset = property setting). There is a slim possibility what I've done is only appropriate for a parallel filesystem, but I don't think that's the case. The "independent files" option was there to take care of that case.

@jdasmith
Copy link
Collaborator

... all this of course is speculative, and we should do some testing to be sure. I used bilder's mpich as part of a puffin build (though with an older branch), and did not experience such problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants