Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some fastparquet-related tests are failing on Python 3.10 #896

Open
jrbourbeau opened this issue Oct 26, 2023 · 10 comments
Open

Some fastparquet-related tests are failing on Python 3.10 #896

jrbourbeau opened this issue Oct 26, 2023 · 10 comments

Comments

@jrbourbeau
Copy link
Member

I've seen

FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_timestamp96 - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_with_tz[fastparquet] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

with tracebacks like this

_______________________________ test_timestamp96 _______________________________
[gw1] linux -- Python 3.10.12 /usr/share/miniconda3/envs/test-environment/bin/python3.10

tmpdir = local('/tmp/pytest-of-runner/pytest-0/popen-gw1/test_timestamp960')

    @FASTPARQUET_MARK
    def test_timestamp96(tmpdir):
        fn = str(tmpdir)
        df = pd.DataFrame({"a": [pd.to_datetime("now", utc=True)]})
        ddf = dd.from_pandas(df, 1)
        ddf.to_parquet(fn, engine="fastparquet", write_index=False, times="int96")
        pf = fastparquet.ParquetFile(fn)
        assert pf._schema[1].type == fastparquet.parquet_thrift.Type.INT96
>       out = dd.read_parquet(fn, engine="fastparquet", index=False).compute()

dask/dataframe/io/tests/test_parquet.py:1883: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/base.py:342: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:628: in compute
    results = schedule(dsk, keys, **kwargs)
dask/dataframe/io/parquet/core.py:96: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:654: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:655: in <listcomp>
    func(
dask/dataframe/io/parquet/fastparquet.py:1075: in read_partition
    return cls.pf_to_pandas(
dask/dataframe/io/parquet/fastparquet.py:1115: in pf_to_pandas
    df, views = pf.pre_allocate(size, columns, categories, index)
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:797: in pre_allocate
    df, arrs = _pre_allocate(size, columns, categories, index, cats,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:1051: in _pre_allocate
    df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/dataframe.py:202: in empty
    values = type(bvalues)._from_sequence(values, copy=False, dtype=bvalues.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas/_libs/tslibs/tzconversion.pyx:187: ValueError

showing up this morning on multiple PRs. See this CI build for full details.

Note all the errors involve fastparquet, which had a release yesterday. @martindurant any idea what might be happening here?

@martindurant
Copy link
Member

Transferring to fastparquet, but will keep you in the loop @jrbourbeau

@martindurant
Copy link
Member

(actually, I can't transfer, will duplicate)

@jrbourbeau jrbourbeau transferred this issue from dask/dask Oct 26, 2023
@jrbourbeau
Copy link
Member Author

Just transferred over

@martindurant
Copy link
Member

Regression due to #893 @jbrockmendel

@martindurant
Copy link
Member

Note that the same tests did pass in fastparequet's CI: e.g. https://github.com/dask/fastparquet/actions/runs/6615631492/job/17968182303#step:6:83
Maybe we have different versions of pandas?

@jbrockmendel
Copy link

This surfaces a bug upstream that i'll work on. Fortunately its easy to work around here. in #893 instead of passing dt64 values pass int64 values to _from_sequence. That will also be more performant.

@martindurant
Copy link
Member

values = type(bvalues)._from_sequence(values.view("int64"), copy=False, dtype=bvalues.dtype)

?

I am puzzled why only this invocation of the same method would need this, but if you say so...

@jbrockmendel
Copy link

I am puzzled why only this invocation of the same method would need this, but if you say so...

You are not alone in this. The API design question from ages ago was: "when passing dt64 values and a pd.DatetimeTZDtype to DatetimeIndex (which has the same behavior as _from_sequence here), do we interpret them as wall-times or UTC times?" We eventually landed on wall-times, while i8 values get interpeted as UTC times. wall times need to go through a cython function that converts the to UTC times. It is that cython function that is raising.

@mrocklin
Copy link
Member

Dask CI continues to fail during this period. Should we xfail these tests in the meantime?

@jrbourbeau
Copy link
Member Author

I believe a new fastparquet release is imminent after #899 is merged (though I don't object to xfail either)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants