Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet #39914

Open
cvm-a opened this issue Feb 2, 2024 · 7 comments

Comments

@cvm-a
Copy link

cvm-a commented Feb 2, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Simple repro:

import pyarrow as pa
import pandas as pd
pd.DataFrame({"x":pa.array(pd.Series([[2.2]*5]*10)).to_pandas(types_mapper=pd.ArrowDtype)}).to_parquet("/tmp/list4.pqt")
df2 = pd.read_parquet("/tmp/list4.pqt",  dtype_backend="pyarrow")

Fails with

File ~/<redacted>/lib/python3.11/site-packages/pandas/io/parquet.py:667, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
    664     use_nullable_dtypes = False
    665 check_dtype_backend(dtype_backend)
--> 667 return impl.read(
    668     path,
    669     columns=columns,
    670     filters=filters,
    671     storage_options=storage_options,
    672     use_nullable_dtypes=use_nullable_dtypes,
    673     dtype_backend=dtype_backend,
    674     filesystem=filesystem,
    675     **kwargs,
    676 )

File ~/<redacted>/lib/python3.11/site-packages/pandas/io/parquet.py:281, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    273 try:
    274     pa_table = self.api.parquet.read_table(
    275         path_or_handle,
    276         columns=columns,
   (...)
    279         **kwargs,
    280     )
--> 281     result = pa_table.to_pandas(**to_pandas_kwargs)
    283     if manager == "array":
    284         result = result._as_manager("array", copy=False)

File ~/<redacted>lib/python3.11/site-packages/pyarrow/array.pxi:884, in pyarrow.lib._PandasConvertible.to_pandas()

File ~/<redacted>/lib/python3.11/site-packages/pyarrow/table.pxi:4251, in pyarrow.lib.Table._to_pandas()

File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas_compat.py:769, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
    766     table = _add_any_metadata(table, pandas_metadata)
    767     table, index = _reconstruct_index(table, index_descriptors,
    768                                       all_columns, types_mapper)
--> 769     ext_columns_dtypes = _get_extension_dtypes(
    770         table, all_columns, types_mapper)
    771 else:
    772     index = _pandas_api.pd.RangeIndex(table.num_rows)

File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas_compat.py:828, in _get_extension_dtypes(table, columns_metadata, types_mapper)
    823 dtype = col_meta['numpy_type']
    825 if dtype not in _pandas_supported_numpy_types:
    826     # pandas_dtype is expensive, so avoid doing this for types
    827     # that are certainly numpy dtypes
--> 828     pandas_dtype = _pandas_api.pandas_dtype(dtype)
    829     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
    830         if hasattr(pandas_dtype, "__from_arrow__"):

File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas-shim.pxi:141, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/<redacted>/lib/python3.11/site-packages/pyarrow/pandas-shim.pxi:144, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/<redacted>/lib/python3.11/site-packages/pandas/core/dtypes/common.py:1630, in pandas_dtype(dtype)
   1625     with warnings.catch_warnings():
   1626         # GH#51523 - Series.astype(np.integer) doesn't show
   1627         # numpy deprecation warning of np.integer
   1628         # Hence enabling DeprecationWarning
   1629         warnings.simplefilter("always", DeprecationWarning)
-> 1630         npdtype = np.dtype(dtype)
   1631 except SyntaxError as err:
   1632     # np.dtype uses `eval` which can raise SyntaxError
   1633     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'list<item: double>[pyarrow]' not understood

Environment:
OS: MacOS Darwin Kernel Version 22.1.0
Python: 3.11.6
Pandas: 2.2.0
Pyarrow 15.0.0

The same error is raised even if we use pd.read_parquet("/tmp/list4.pqt", dtype_backend="numpy_nullable")

The non-arrow backed column version

import pyarrow as pa
import pandas as pd
pd.DataFrame({"x":pd.Series([[2.2]*5]*10)}).to_parquet("/tmp/list2.pqt")
df2 = pd.read_parquet("/tmp/list2.pqt",  dtype_backend="pyarrow")

is read back correctly, and the column is Arrow backed in the new dataframe, so it doesn't survive a further round trip.

I did some further digging at the Parquet Pandas metadata, and found that for the Parquet written from the Arrow-based table, we have

{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 10,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'x',
   'field_name': 'x',
   'pandas_type': 'list[float64]',
   'numpy_type': 'list<element: double>[pyarrow]',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '15.0.0'},
 'pandas_version': '2.2.0'}

whereas for the numpy-based dataframe, the output is:

{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 10,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'x',
   'field_name': 'x',
   'pandas_type': 'list[float64]',
   'numpy_type': 'object',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '15.0.0'},
 'pandas_version': '2.2.0'}

The problem seems to be caused by the numpy_type for the arrow case being set to 'list<element: double>[pyarrow]' rather than object or a numpy array type.

Component(s)

Parquet, Python

@cvm-a cvm-a added the Type: bug label Feb 2, 2024
@kou kou changed the title Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet [Python][Parquet] Parquet files created from Pandas dataframes with Arrow-backed list columns cannot be read by pd.read_parquet Feb 2, 2024
@jorisvandenbossche
Copy link
Member

Thanks for the report! This seems to be an issue on the pandas side. Could you open an issue there? (https://github.com/pandas-dev/pandas/issues/)

The "list<item: double>[pyarrow]" is created by pandas, and I think we expect that pandas can also parse that string in pd.api.types.pandas_dtype

@gbrlbrbs
Copy link

gbrlbrbs commented Apr 4, 2024

To add a solution for anyone having the same problem and ended up here, you can do:

from pyarrow.parquet import read_table
import pandas as pd

tab = read_table("table.pqt")
df = tab.to_pandas(types_mapper=pd.ArrowDtype, ignore_metadata=True)

@mahadi
Copy link

mahadi commented Sep 25, 2024

Anyone who is interested, this is the pandas issue: pandas-dev/pandas#57411

For my use case it was better to not store the meta data, so my workaround is on the exporting side.

df.to_parquet(output_path, engine="pyarrow", store_schema=False)

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

@bretttully
Copy link

@jorisvandenbossche I'm interested in your thoughts on this PR #44720 -- it doesn't solve the broader issue for non pyarrow backends, but it does prevent the multitude of workarounds that are getting used with the pyarrow backend.

@jorisvandenbossche
Copy link
Member

@bretttully thanks for the ping! That looks like a nice solution for now, to avoid an error if the types_mapper would get precedence anyway.

FWIW, long term I think this logic should live on the pandas side, so this could all be much better controlled over there (#44068)

@bretttully
Copy link

@bretttully thanks for the ping! That looks like a nice solution for now, to avoid an error if the types_mapper would get precedence anyway.

FWIW, long term I think this logic should live on the pandas side, so this could all be much better controlled over there (#44068)

I saw that extra ticket, and it would indeed be a nice change, although it would mean a bunch of stuff in geopandas would need to change...

@cvm-a
Copy link
Author

cvm-a commented Nov 14, 2024

A workaround I have been using that reads the metadata to preserve the index etc is to modify the metadata while reading:

def _fix_pandas_metadata(table: pa.Table) -> pa.Table:
    """Workaround for pyarrow/#39914 -> many arrow type columns raise errors"""
    pd_md = table.schema.pandas_metadata  # mutable, since it's freshly JSON decoded.
    if not pd_md:
        return table
    schema_md: dict = table.schema.metadata or {}
    for col_md in itertools.chain(
        pd_md.get("columns", []), pd_md.get("column_indexes", [])
    ):
        try:
            np.dtype(col_md["numpy_type"])
        except (TypeError, ValueError):
            # Should have always been this, but pandas/ Pyarrow folks messed up.
            col_md["numpy_type"] = "object"
    return table.replace_schema_metadata(schema_md | {b"pandas": json.dumps(pd_md)})


def read_arrow_parquet(
    path,
    columns: list[str] | None = None,
    filesystem: Any = None,
    filters: list[tuple] | list[list[tuple]] | None = None,
    dtype_backend="pyarrow",  # only for maintaining interface with pd.read_parquet
):
    try:
        return pd.read_parquet(
            path,
            columns=columns,
            dtype_backend=dtype_backend,
            filesystem=filesystem,
            filters=filters,
        )
    except (TypeError, ValueError, NotImplementedError):
        # Probably bad pandas metadata
        pass
    if dtype_backend != "pyarrow":
        raise NotImplementedError("Only Arrow based table read implemented.")
    table = pa.parquet.read_table(
        path,
        columns=columns,
        filesystem=filesystem,
        filters=filters,
        use_pandas_metadata=True,
    )
    return _fix_pandas_metadata(table).to_pandas(types_mapper=pd.ArrowDtype)

I believe the long-term fix cannot be rolled out just on the pandas side since the bad metadata is being written by the arrow library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants