-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet #39914
Comments
Thanks for the report! This seems to be an issue on the pandas side. Could you open an issue there? (https://github.com/pandas-dev/pandas/issues/) The |
To add a solution for anyone having the same problem and ended up here, you can do:
|
Anyone who is interested, this is the pandas issue: pandas-dev/pandas#57411 For my use case it was better to not store the meta data, so my workaround is on the exporting side. df.to_parquet(output_path, engine="pyarrow", store_schema=False) https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html |
@jorisvandenbossche I'm interested in your thoughts on this PR #44720 -- it doesn't solve the broader issue for non pyarrow backends, but it does prevent the multitude of workarounds that are getting used with the pyarrow backend. |
@bretttully thanks for the ping! That looks like a nice solution for now, to avoid an error if the types_mapper would get precedence anyway. FWIW, long term I think this logic should live on the pandas side, so this could all be much better controlled over there (#44068) |
I saw that extra ticket, and it would indeed be a nice change, although it would mean a bunch of stuff in geopandas would need to change... |
A workaround I have been using that reads the metadata to preserve the index etc is to modify the metadata while reading: def _fix_pandas_metadata(table: pa.Table) -> pa.Table:
"""Workaround for pyarrow/#39914 -> many arrow type columns raise errors"""
pd_md = table.schema.pandas_metadata # mutable, since it's freshly JSON decoded.
if not pd_md:
return table
schema_md: dict = table.schema.metadata or {}
for col_md in itertools.chain(
pd_md.get("columns", []), pd_md.get("column_indexes", [])
):
try:
np.dtype(col_md["numpy_type"])
except (TypeError, ValueError):
# Should have always been this, but pandas/ Pyarrow folks messed up.
col_md["numpy_type"] = "object"
return table.replace_schema_metadata(schema_md | {b"pandas": json.dumps(pd_md)})
def read_arrow_parquet(
path,
columns: list[str] | None = None,
filesystem: Any = None,
filters: list[tuple] | list[list[tuple]] | None = None,
dtype_backend="pyarrow", # only for maintaining interface with pd.read_parquet
):
try:
return pd.read_parquet(
path,
columns=columns,
dtype_backend=dtype_backend,
filesystem=filesystem,
filters=filters,
)
except (TypeError, ValueError, NotImplementedError):
# Probably bad pandas metadata
pass
if dtype_backend != "pyarrow":
raise NotImplementedError("Only Arrow based table read implemented.")
table = pa.parquet.read_table(
path,
columns=columns,
filesystem=filesystem,
filters=filters,
use_pandas_metadata=True,
)
return _fix_pandas_metadata(table).to_pandas(types_mapper=pd.ArrowDtype) I believe the long-term fix cannot be rolled out just on the pandas side since the bad metadata is being written by the arrow library |
Describe the bug, including details regarding any error messages, version, and platform.
Simple repro:
Fails with
Environment:
OS: MacOS Darwin Kernel Version 22.1.0
Python: 3.11.6
Pandas: 2.2.0
Pyarrow 15.0.0
The same error is raised even if we use
pd.read_parquet("/tmp/list4.pqt", dtype_backend="numpy_nullable")
The non-arrow backed column version
is read back correctly, and the column is Arrow backed in the new dataframe, so it doesn't survive a further round trip.
I did some further digging at the Parquet Pandas metadata, and found that for the Parquet written from the Arrow-based table, we have
whereas for the numpy-based dataframe, the output is:
The problem seems to be caused by the numpy_type for the arrow case being set to 'list<element: double>[pyarrow]' rather than object or a numpy array type.
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: