Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more defensive about inplace decompression in V2 #890

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

martindurant
Copy link
Member

Fixes #889

@martindurant
Copy link
Member Author

cc @miohtama

@miohtama
Copy link

I can confirm the fix works.

Unrelated, as I was working on it, I also timed both PyArrow and FastParquet read methods:

from pyarrow import parquet as pq  
from fastparquet import ParquetFile  

path = "./lending-reserves-all.parquet"

start = time.time()
df1 = pq.read_table(path).to_pandas()
print("PyArrow is", time.time() - start)

start = time.time()
pf2 = ParquetFile(path)
df2 = pf2.to_pandas()
print("FastParquet is", time.time() - start)

PyArrow is 6.349822044372559
FastParquet is 12.303039073944092


@martindurant
Copy link
Member Author

For decimals, fastparquet currently converts to float and this probably accounts for the time difference. After conversion, float operations will usually be faster. my attempt at implementing decimals natively was even faster than anything that pyarrow can do, but due to lack of interest, I didn't implement direct fastparquet->decimal reading.

@martindurant martindurant merged commit 165bec3 into dask:main Oct 18, 2023
20 checks passed
@martindurant martindurant deleted the fewer_inplace branch October 18, 2023 13:14
@miohtama
Copy link

@martindurant Thank you for the feedback.

Note that the difference is not that huge and likely depends on the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants