Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HDF5.jl compatible with profilers that use LD_PRELOAD #791

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

lyon-fnal
Copy link

@lyon-fnal lyon-fnal commented Dec 31, 2020

See MPI.jl PR#450. Use of ccall( (func, lib), ...) is not compatible with applications that use LD_PRELOAD to inject their own code for profiling and tracing. Darshan is an example of an application that does this for MPI and HDF5 I/O profiling.

This PR has changes to explicitly load the libhdf5 and libhdf5_hl shared objects with Libdl.dlopen in HDF5.__init__. Most ccall statements are changed to not specify the HDF5 library (e.g. I changed the generator code).

I tested on NERSC Cori Haswell nodes. HDF5 tests pass. I was able to get Darshan to work with this PR.

Happy New Year too!

@@ -38,7 +38,7 @@ else
if libhdf5_size != filesize(Libdl.dlpath(libhdf5))
error("HDF5 library has changed, re-run Pkg.build(\\\"HDF5\\\")")
end
if libversion < v"1.10.4"
if libversion[] < v"1.10.4"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libversion is now set in HDF5.__init__ so it needs to be a Ref.

@@ -709,7 +709,7 @@ dspace_scal = HDF5.Dataspace(HDF5.h5s_create(HDF5.H5S_SCALAR))
dspace_norm = dataspace((100, 4))
dspace_maxd = dataspace((100, 4), max_dims = (256, 4))
dspace_slab = HDF5.hyperslab(dataspace((100, 4)), 1:20:100, 1:4)
if HDF5.libversion ≥ v"1.10.7"
if HDF5.h5_get_libversion() ≥ v"1.10.7"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call the function instead of using the const global (to do the latter, it would need to be HDF5.libversion[] since it's a Ref).

@musm
Copy link
Member

musm commented Dec 31, 2020

Thanks for the PR!
At a quick glance this looks like a workaround in HDF5 and MPI where it feels like it should be handled by the Darshan wrapper (potentially through it's own code transformations) (is there an example of running Darshan with HDF5/MPI ?). In particular, our use of ccall here is the canonical way in Julia to call methods in other libraries, so I'm not sure if making a work around here is the most sensible long-term approach.

In particular, I don't understand why exactly:

Use of ccall( (func, lib), ...) is not compatible with applications that use LD_PRELOAD to inject their own code for profiling and tracing

Perhaps a tool like Cassete.jl could be used to inject code transformations here.

@lyon-fnal
Copy link
Author

I agree this isn't a small change. What Darshan and other applications do is inject their own "instrumenting" code into your code by using LD_PRELOAD. That is, their shared object is loaded first before, say, libhdf5.so. So when you call an HDF5 C function they've overridden, it calls theirs and then they call the real function in libhdf5.so (see an example). This allows them to do profiling and tracing. This technique is fairly common in C and C++ for profilers and tracers.

Unfortunately, the canonical use of ccall where you specify the library breaks use of LD_PRELOAD since it's not letting the dynamic linker figure out which function to call. I'm going to complain to the Julia developers and hope that they put at least a warning in the documentation about this.

I can understand your reluctance to make a fundamental change like this. But Darshan is really useful and I and others want to use it to profile HDF5 I/O. Having this capability with Julia would be nice.

I hadn't thought to try transformations with Cassette.jl. I can look into that if you'd prefer. Do you have much experience with Cassette? I don't. Thanks!

@musm
Copy link
Member

musm commented Jan 1, 2021

There are several real failures in CI to take a look at.

@giordano
Copy link
Member

giordano commented Jan 1, 2021

JLL packages already have way to allow you to point the library to something else, without having to write some custom code to do the same thing: https://docs.binarybuilder.org/dev/jll/#Non-dev'ed-JLL-packages. Would it work for you? I don't see the difference between using LD_PRELOAD and pointing the library to whatever you want through the override mechanism

@lyon-fnal
Copy link
Author

lyon-fnal commented Jan 23, 2021

JLL packages already have way to allow you to point the library to something else, without having to write some custom code to do the same thing: https://docs.binarybuilder.org/dev/jll/#Non-dev'ed-JLL-packages. Would it work for you? I don't see the difference between using LD_PRELOAD and pointing the library to whatever you want through the override mechanism

No, I don't think that mechanism will work. LD_PRELOAD allows for intercepting calls to a library, which is how Darshan works (e.g. an HDF5 call is intercepted by a Darshan function that wraps the HDF5 call with a timer). The nice thing about LD_PRELOAD is that you don't need to re-build the HDF5 library. It all happens at Runtime. The mechanism above doesn't do that.

Perhaps take a look at JuliaParallel/MPI.jl#451 . MPI.jl already wrapped nearly every ccall in a macro that would alter the ccall arguments for 32-bit windows. So now that macro also removes the library specification.

That doesn't really help HDF5.jl, since you have bare ccall calls.

If you want to make HDF5.jl compatible with Darshan (and that's up to you, of course) then I see two choices...

  1. Do this PR and omit the library specification from the ccall statements. This does match how C and C++ call functions (relying on the dynamic linker to find them).
  2. Some mechanism with Cassette.jl. This could potentially happen outside of the package. I'm not sure how easy or hard that would be. I can look into this.

In the meantime, I have my private version of HDF5.jl which I'm using to profile i/o in my application. I'm meeting with the Darshan developers next week to understand the profiling output better. I'll try to write up what I've learned so you can see if this is useful or not.

@jmert
Copy link
Contributor

jmert commented Jan 23, 2021

I'm not necessarily opposed to making these changes, but:

  1. I'd appreciate seeing a larger Julia-ecosystem discussion of how to handle situations like this (or have my admittedly maybe spotty understanding clarified). As far as I understand, this differs from the canonical Julia recommendation. (That being said, I'm sympathetic to your request! I'm currently still being forced to work in an ancient version of Matlab that only continues to work because I've used disgusting LD_PRELOAD tricks to keep it running, so preloaded libraries have saved my bacon! I'm personally just hesitant to support features/behaviors which differ from how the rest of the ecosystem works.)
  2. Obviously, the CI failures need to be resolved — I'm guessing some of the failures are due to the version-bounded build-time codegen which depends on the library version and would need to have the library explicitly loaded in this PR. Namely, see
    _libhdf5_build_ver_expr = quote
    _libhdf5_build_ver = let
    majnum, minnum, relnum = Ref{Cuint}(), Ref{Cuint}(), Ref{Cuint}()
    r = ccall((:H5get_libversion, libhdf5), herr_t,
    (Ref{Cuint}, Ref{Cuint}, Ref{Cuint}),
    majnum, minnum, relnum)
    r < 0 && error("Error getting HDF5 library version")
    VersionNumber(majnum[], minnum[], relnum[])
    end
    end
    , which likely needs its own private library load before use.
  3. HDF5_jll.libhdf5_handle and HDF5_jll.libhdf5_hl_handle already exist — is there any use in reusing those when the JLL is used and only manually opening the library when the system lib is in use? (I'm guessing that OS filesystem caching means basically no, but I'd appreciate evidence one way or the other.)

That being said, I think some of the minor items like using HD5.h5_get_libversion() instead of HDF5.libversion are pretty non-controversial and could be merged if submitted separately.

(After writing all of the above, I finally looked at your MPI.jl link — isn't the Sys.isunix() use incompatible with how this library uses a pre-generated src/api.jl that is not system dependent? Not saying that isn't impossible — we could add a @static if system-dependent expressions into the binding generator — but that would definitely be another consideration.)

@lyon-fnal
Copy link
Author

I agree that a larger discussion is needed. I submitted https://discourse.julialang.org/t/discussion-of-ccall-function-library/53887 . Let's see what happens.

@musm
Copy link
Member

musm commented Feb 24, 2021

The discussion on discourse is indeed very insightful. Given the feedback in the thread it does sound like the course of action is to incorporate the necessary changes in Base to support this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants