Replies: 3 comments 7 replies
-
Such a speedup indeed sounds quite interesting, even more if it does not blow up memory uncontrolled as it appears. |
Beta Was this translation helpful? Give feedback.
-
The idea sounds very promising. If we manage to include the additions suggested by Laurenz, I don't see an issue adding it to the library. Of course, after implementation, we can do tests to see if it works with the use cases. |
Beta Was this translation helpful? Give feedback.
-
I'd like to follow up on this discussion, as your suggestion @daviddoji is very interesting indeed. The current (dask based) system works fine, but is certainly not the best one. If you could help develop an alternative, that would be great. |
Beta Was this translation helpful? Give feedback.
-
I've been playing around a bit with the second notebook, but, as @rettigl suggested, I increased artificially the number of files on purpose to check how Dask handles these datasets.
Initially, the downloaded dataset has 100 files of 116 MB each. I duplicated random files from the initial dataset to get 1000 of them, so a 10x increased (113 GB in total)
It turns out that on a node from Maxwell, where I run the tests, the result from calling
bin_dataframe
uses almost 18 GB of RAM and needs almost 9min to runAn alternative implementation not involving Dask but numba and multiprocessing uses more RAM, 32 GB, but does the calculations in roughly 30 s
Considering that neither the current nor my implementation would fit into a regular pc's memory (8GB in my laptop, for example) with this kind of fake but realistic datasets (at least for the XFEL standards), I'm wondering if you would like to incorporate it to the library.
cc @steinnymir @zainsohail04
Beta Was this translation helpful? Give feedback.
All reactions