Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas support #50

Open
UniqASL opened this issue Mar 31, 2020 · 7 comments
Open

Pandas support #50

UniqASL opened this issue Mar 31, 2020 · 7 comments

Comments

@UniqASL
Copy link

UniqASL commented Mar 31, 2020

Dear Even Solbraa,

Thank you for putting this very nice library online.
Do you have plans in the future to add support for pandas dataframes? Running the code to calculate the fluid properties for a single point (T, p) takes a few seconds - due to the connection with Java I guess - so I am a bit afraid of the time it will take to run the code on a (large) dataframe.

Best regards,

@EvenSol
Copy link
Collaborator

EvenSol commented Apr 16, 2020

Yes , there will be more integration of pandas dataframes in future releases.
Recently it has been added support for creating fluids via dataframes.

See some examples in the Colab sheet:
https://colab.research.google.com/github/EvenSol/NeqSim-Colab/blob/master/notebooks/PVT/PVTreports.ipynb

or:

https://github.com/equinor/neqsimpython/blob/master/examples/createFluid.py

@EvenSol
Copy link
Collaborator

EvenSol commented Apr 16, 2020

A benchmark running 5000 multiphase calculations for a simple gas/oil/water fluid is done in the linked Colab page. 5000 TPflash calculations takes about 5-6 sec in Colab/Python.

https://colab.research.google.com/drive/1JXaqqj1qkriY_DqT8nCpf0tCjNEcCvEW

@EvenSol
Copy link
Collaborator

EvenSol commented Apr 16, 2020

See this example filling a dataframe with properties:

https://github.com/equinor/neqsimpython/blob/master/examples/propertiesDataframes.py

@UniqASL
Copy link
Author

UniqASL commented Apr 17, 2020

Thank you very much for taking some time for that. I tried your last example. The list having a length of 1,000 is running in approx. 14 s on my laptop. When I increase this value to 10,000 (~one year of data with an hourly frequency), it takes > 2 min.

The problem I guess is that when applying the function calcProperties on the df, python makes call to Java for each single point, which slows the calculations down. One option I can imagine would be to send the entire df to Java and then get the results back as list or df once Java ist done calculating everything.

@EvenSol
Copy link
Collaborator

EvenSol commented Apr 17, 2020

Yes, it will be some overhead when there are many calls to Java from Python. In the benchmark (https://equinor.github.io/neqsimhome/benchmark.html) it is indicated that the calculation speed is 2-3 times faster direct in Java compared to via Python. I will look into the reason for this, and if this can be improved. I guess every call to a java method has some overhead (even just reading some property), and that it can be improved by returning more information in each call. If the calculation will involve a process simulation for each time step (instead of just a flash and returning the properties of a fluid), I guess this overhead will be less significant.

I will look into your suggestion of sending the whole dataframe. Thanks for the suggestion.

@EvenSol
Copy link
Collaborator

EvenSol commented Apr 21, 2020

A new method has been implemented to fill a dataframe based on a list of tempeatures and pressures (method 2 in the example):

https://github.com/equinor/neqsimpython/blob/master/examples/propertiesDataframes.py

Probably the dataframes in the PySpark project will be a better solution in future work. This can be looked into when PySpark 3.0 will be released (the current version is based on an older version of py4j).

@UniqASL
Copy link
Author

UniqASL commented Apr 27, 2020

Thanks for this! The second method is slightly faster eventhough it returns much more results than method 1 (18 with method 1, 63 with method 2). The overall calculations still remains quite slow however (~50 s with method 2 for 1000 values). Maybe PySpark will improve that!

Otherwise, I noticed that you make your imports in the code itself. Normally in python you should import eveything at the beginning. Moreover it is recommended to import entire modules. You can have a look here for instance, section "import".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants