Some `fastparquet`-related tests are failing on Python 3.10 #896

jrbourbeau · 2023-10-26T15:02:46Z

I've seen

FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_timestamp96 - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_with_tz[fastparquet] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

with tracebacks like this

_______________________________ test_timestamp96 _______________________________
[gw1] linux -- Python 3.10.12 /usr/share/miniconda3/envs/test-environment/bin/python3.10

tmpdir = local('/tmp/pytest-of-runner/pytest-0/popen-gw1/test_timestamp960')

    @FASTPARQUET_MARK
    def test_timestamp96(tmpdir):
        fn = str(tmpdir)
        df = pd.DataFrame({"a": [pd.to_datetime("now", utc=True)]})
        ddf = dd.from_pandas(df, 1)
        ddf.to_parquet(fn, engine="fastparquet", write_index=False, times="int96")
        pf = fastparquet.ParquetFile(fn)
        assert pf._schema[1].type == fastparquet.parquet_thrift.Type.INT96
>       out = dd.read_parquet(fn, engine="fastparquet", index=False).compute()

dask/dataframe/io/tests/test_parquet.py:1883: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/base.py:342: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:628: in compute
    results = schedule(dsk, keys, **kwargs)
dask/dataframe/io/parquet/core.py:96: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:654: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:655: in <listcomp>
    func(
dask/dataframe/io/parquet/fastparquet.py:1075: in read_partition
    return cls.pf_to_pandas(
dask/dataframe/io/parquet/fastparquet.py:1115: in pf_to_pandas
    df, views = pf.pre_allocate(size, columns, categories, index)
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:797: in pre_allocate
    df, arrs = _pre_allocate(size, columns, categories, index, cats,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:1051: in _pre_allocate
    df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/dataframe.py:202: in empty
    values = type(bvalues)._from_sequence(values, copy=False, dtype=bvalues.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas/_libs/tslibs/tzconversion.pyx:187: ValueError

showing up this morning on multiple PRs. See this CI build for full details.

Note all the errors involve fastparquet, which had a release yesterday. @martindurant any idea what might be happening here?

The text was updated successfully, but these errors were encountered:

martindurant · 2023-10-26T15:04:18Z

Transferring to fastparquet, but will keep you in the loop @jrbourbeau

martindurant · 2023-10-26T15:04:38Z

(actually, I can't transfer, will duplicate)

jrbourbeau · 2023-10-26T15:06:14Z

Just transferred over

martindurant · 2023-10-26T15:08:09Z

Regression due to #893 @jbrockmendel

martindurant · 2023-10-26T15:10:44Z

Note that the same tests did pass in fastparequet's CI: e.g. https://github.com/dask/fastparquet/actions/runs/6615631492/job/17968182303#step:6:83
Maybe we have different versions of pandas?

jbrockmendel · 2023-10-26T15:12:41Z

This surfaces a bug upstream that i'll work on. Fortunately its easy to work around here. in #893 instead of passing dt64 values pass int64 values to _from_sequence. That will also be more performant.

martindurant · 2023-10-26T15:28:22Z

values = type(bvalues)._from_sequence(values.view("int64"), copy=False, dtype=bvalues.dtype)

?

I am puzzled why only this invocation of the same method would need this, but if you say so...

jbrockmendel · 2023-10-26T15:32:54Z

I am puzzled why only this invocation of the same method would need this, but if you say so...

You are not alone in this. The API design question from ages ago was: "when passing dt64 values and a pd.DatetimeTZDtype to DatetimeIndex (which has the same behavior as _from_sequence here), do we interpret them as wall-times or UTC times?" We eventually landed on wall-times, while i8 values get interpeted as UTC times. wall times need to go through a cython function that converts the to UTC times. It is that cython function that is raising.

mrocklin · 2023-10-26T18:17:19Z

Dask CI continues to fail during this period. Should we xfail these tests in the meantime?

jrbourbeau · 2023-10-26T18:36:28Z

I believe a new fastparquet release is imminent after #899 is merged (though I don't object to xfail either)

jrbourbeau transferred this issue from dask/dask Oct 26, 2023

martindurant mentioned this issue Oct 26, 2023

Regression due to _from_sequence #897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some `fastparquet`-related tests are failing on Python 3.10 #896

Some `fastparquet`-related tests are failing on Python 3.10 #896

jrbourbeau commented Oct 26, 2023

martindurant commented Oct 26, 2023

martindurant commented Oct 26, 2023

jrbourbeau commented Oct 26, 2023

martindurant commented Oct 26, 2023

martindurant commented Oct 26, 2023

jbrockmendel commented Oct 26, 2023

martindurant commented Oct 26, 2023

jbrockmendel commented Oct 26, 2023

mrocklin commented Oct 26, 2023

jrbourbeau commented Oct 26, 2023

Some fastparquet-related tests are failing on Python 3.10 #896

Some fastparquet-related tests are failing on Python 3.10 #896

Comments

jrbourbeau commented Oct 26, 2023

martindurant commented Oct 26, 2023

martindurant commented Oct 26, 2023

jrbourbeau commented Oct 26, 2023

martindurant commented Oct 26, 2023

martindurant commented Oct 26, 2023

jbrockmendel commented Oct 26, 2023

martindurant commented Oct 26, 2023

jbrockmendel commented Oct 26, 2023

mrocklin commented Oct 26, 2023

jrbourbeau commented Oct 26, 2023

Some `fastparquet`-related tests are failing on Python 3.10 #896

Some `fastparquet`-related tests are failing on Python 3.10 #896