Data that does not fit into memory #80

daviddoji · 2022-08-24T09:46:52Z

daviddoji
Aug 24, 2022
Collaborator

I've been playing around a bit with the second notebook, but, as @rettigl suggested, I increased artificially the number of files on purpose to check how Dask handles these datasets.

Initially, the downloaded dataset has 100 files of 116 MB each. I duplicated random files from the initial dataset to get 1000 of them, so a 10x increased (113 GB in total)

It turns out that on a node from Maxwell, where I run the tests, the result from calling bin_dataframe uses almost 18 GB of RAM and needs almost 9min to run

An alternative implementation not involving Dask but numba and multiprocessing uses more RAM, 32 GB, but does the calculations in roughly 30 s

Considering that neither the current nor my implementation would fit into a regular pc's memory (8GB in my laptop, for example) with this kind of fake but realistic datasets (at least for the XFEL standards), I'm wondering if you would like to incorporate it to the library.

cc @steinnymir @zainsohail04

rettigl · 2022-08-24T09:55:56Z

rettigl
Aug 24, 2022
Maintainer

Such a speedup indeed sounds quite interesting, even more if it does not blow up memory uncontrolled as it appears.
But how can we handle flexible column conversion with this method? I.e. conversion from time to energy, jittering, etc.? This is what we use dask for in the first place. If this all works as conveniently as with dask, I am all in.

5 replies

daviddoji Aug 24, 2022
Collaborator Author

Thanks for having a look into it.
As discussed in our last meeting, if you could provide with a toy example of what you want to achieve, I'll try to have a look and check if I can improve it.
Especially, I would be interested in a notebook where you load data and do some of those computations in Dask.

rettigl Aug 24, 2022
Maintainer

https://github.com/nomad-coe/nomad-parser-nexus/blob/master/tests/data/tools/dataconverter/readers/mpes/mpes.WSe2.Binning2NeXus.mpes.Example.2.ipynb

From "step 1" onwards.
In particular everything after "Dataframe processor"

rettigl Aug 24, 2022
Maintainer

The apply_jitter function in dfops.py here is another example

daviddoji Aug 24, 2022
Collaborator Author

The jittering thingy is also covered in my implementation. But, as I showed, it needs a bit more of computation time

rettigl Aug 24, 2022
Maintainer

Well, what we converged to finally is that it is quite important to do this as the very first step in any conversion pipeline that you do, and not only during the binning, as implemented originally. Hence the function that works on the dataframe directly. Otherwise, for nonlinear conversions, you do not apply consistent jittering to all bins, and you have to provide the linearized, inverse conversion functions

zain-sohail · 2022-08-25T12:36:08Z

zain-sohail
Aug 25, 2022
Maintainer

The idea sounds very promising. If we manage to include the additions suggested by Laurenz, I don't see an issue adding it to the library. Of course, after implementation, we can do tests to see if it works with the use cases.

0 replies

steinnymir · 2022-09-20T08:48:33Z

steinnymir
Sep 20, 2022
Maintainer

I'd like to follow up on this discussion, as your suggestion @daviddoji is very interesting indeed.
Did you give any thought to the lazy column operations not using dask? If not, is there any input you need from our side to do so?

The current (dask based) system works fine, but is certainly not the best one. If you could help develop an alternative, that would be great.

2 replies

daviddoji Sep 20, 2022
Collaborator Author

Yes, I gave it a thought and did also some tests but didn't find a magic recipe yet. I have an alternative that loads the whole dataset in memory much faster than how dask loads the entire dataset. But that could be problematic with huge datasets.

I did discuss it with Manuel and he suggested that the use case you have for using dask is to have a sort of near-online data analysis tool. If that's the case, I'm towards the adoption of the tools developed at EuXFEL by us in the data analysis group, namely [metropc] (https://desy.de/~schmidtp/metropc-docs/index.html) framework.

steinnymir Sep 20, 2022
Maintainer

what do you mean exactly with adopting the EuXFEL tools? that we should restructure the sed pipeline to work with that framework?
It does look nice from what I can see in the documentation you shared, but I am not sure how much work that would be to implement this. Also, can it be used on any computer or does it require a server to run it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data that does not fit into memory #80

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data that does not fit into memory #80

daviddoji Aug 24, 2022 Collaborator

Replies: 3 comments · 7 replies

rettigl Aug 24, 2022 Maintainer

daviddoji Aug 24, 2022 Collaborator Author

rettigl Aug 24, 2022 Maintainer

rettigl Aug 24, 2022 Maintainer

daviddoji Aug 24, 2022 Collaborator Author

rettigl Aug 24, 2022 Maintainer

zain-sohail Aug 25, 2022 Maintainer

steinnymir Sep 20, 2022 Maintainer

daviddoji Sep 20, 2022 Collaborator Author

steinnymir Sep 20, 2022 Maintainer

daviddoji
Aug 24, 2022
Collaborator

Replies: 3 comments 7 replies

rettigl
Aug 24, 2022
Maintainer

daviddoji Aug 24, 2022
Collaborator Author

rettigl Aug 24, 2022
Maintainer

rettigl Aug 24, 2022
Maintainer

daviddoji Aug 24, 2022
Collaborator Author

rettigl Aug 24, 2022
Maintainer

zain-sohail
Aug 25, 2022
Maintainer

steinnymir
Sep 20, 2022
Maintainer

daviddoji Sep 20, 2022
Collaborator Author

steinnymir Sep 20, 2022
Maintainer