-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observed memory leakage in NanoeventsFactory.from_root() #1072
Comments
Can you please try this without using a dask distributed client and in single threaded mode. You will have to watch top to understand if there is a problem.
I assume the 1 GB memory limit above is only to make it fail sooner? |
Yes, 1 GiB memory limit is only to make it fail sooner. I suppose you would like me to use memray as well. I have never used that before, so it may take a while since I have to scan the documentations. Could you also be more specific when you say "watch top to understand if there is a problem". Like, what should I be looking for at top? The increasing memory usage? |
Increasing RSS over time, @yimuchen may have some helpful scripts here. |
One tool that you can use to test locally to see if this is from coffea.nanoevents import NanoEventsFactory
snapshot = tracemalloc.take_snapshot()
for filename in [f'/srv/Ntuples/{i}_RA2AnalysisTree.root' for i in range(20)]:
step_size = 500 # Force slicing the file into multiple chunks to see if this memory leaks appear per chunk
for start in range(0, 30000, step_size):
events = NanoEventsFactory.from_root(filename,
treepath='TreeMaker2/PreSelection',
schemaclass=BaseSchema,
metadata={
'dataset': 'test'
},
entry_start=start,
entry_stop=start + step_size).events()
run_my_analysis(events)
s2 = tracemalloc.take_snapshot()
for s in s2.statistics('lineno', cumulative=True)[:10]:
print(s) What you would want to look out for is if there is a singular function that monotonously increases in each subsequent |
There is also memory-profiler [1], but this only shows if there are leaks happening, not so much where the leak is appearing. But it makes plots which might be easier on the eyes then trying to guess if there are leaks based on text outputs (remember to use the |
I'm trying your demo repository (https://github.com/green-cabbage/dask_mem_leak_test) now, with
(from Though, I do see a lot of
I tried with and without
with no real difference. After 11 files, I get a read error:
|
Sorry for the late reply. The OSError IMO is primarily due to purdue XRootD not being the most reliable. I have updated the Moreover, observing lots of if you keep running the script, the unmanaged memory slowly creeps up until the client automatically goes through phases of pausing and unpausing until it completely freezes due to high unmanaged memory. On a seperate note, I have ran the local client test with and without Edit: I have made a dask report of the running through memleakage_demo.ipynb the with local root file paths using dask local client here, and you can see in the task stream tab that the gap between the "pt" workload increases dramatically due to significant compute time being dedicated to managing leaking memory from the client. |
As a small update, it does seem like it is a XRootD problem. I re-ran the reproducer with local root files path, but with memory increased from 0.7 GiB to 1 GiB, and while I had lots warnings like On a different note, I am not so sure 26% CPU time being used for garbage collection is normal. Is this something I need to look at it separately to optimize my workflow? |
I think you've found instead the "steady state" memory usage. Lots of GC time means you're making a bunch of temporaries. It might be worth it to give the latest coffea and dask-awkward/dask-histogram a try if you're not already. |
This is a quick report of an issue I have noticed with newer version of coffea (2024.3.0) with updates from interactions I had with @lgray . Reproducer of this issue is pushed onto my public repo: https://github.com/green-cabbage/dask_mem_leak_test. Please let me know if I am unclear some certain parts, as I am not experienced in making bug reports.
The basic gist is that loading root files using XRootD links and then doing simple processing leads to dask client's memory slowly increasing until memory overflow unless regularly restarted. This happens for both local cluster as well as external cluster (screen shot of residual memory after no more jobs is added below).
The leakage is observed for both
NanoEventsFactory.from_root()
anduproot.dask()
functions. However, memory leakage is not observed when explicitly usingXRootDSource
handler for uproot, as in:uproot.dask(, handler=uproot.XRootDSource)
. However, implementingXRootDSource
handler forNanoEventsFactory.from_root()
(ieNanoEventsFactory.from_root( ,uproot_options={"handler" : uproot.XRootDSource})
) still leads to memory overload.Moreover, using local filenames instead of XRootD links on
uproot.dask()
doesn't generate the leakage. The same withNanoEventsFactory.from_root()
has not been tried.The text was updated successfully, but these errors were encountered: