Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: .drop() causing mismatch in RecordBatchReader schema #8393

Closed
1 task done
judahrand opened this issue Feb 19, 2024 · 7 comments
Closed
1 task done

bug: .drop() causing mismatch in RecordBatchReader schema #8393

judahrand opened this issue Feb 19, 2024 · 7 comments
Labels
bug Incorrect behavior inside of ibis

Comments

@judahrand
Copy link

judahrand commented Feb 19, 2024

What happened?

Calling .drop() on a Table which contains a column name longer than 59 characters and then calling .to_pyarrow_batches() results in a broken RecordBatchReader where the contained RecordBatches do not match the schema of the RecordBatchReader due to column name truncations. This breaks RecordBatchReader.read_all().

What is surprising, however, is that if the .drop() is the first method called after conn.register() the truncation does not seem to occur.

import ibis
import pyarrow


table = pyarrow.Table.from_pydict(
    {
        "a_short_column_name": [1, 2, 3],
        "a_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_long_column_name": [4, 5, 6],
    },
)

ibis.set_backend("duckdb")
ibis_table = ibis.memtable(table)

result = ibis_table.select(
    *ibis_table.schema().keys(),
).drop(
    "a_short_column_name",
)

record_batch_reader = result.to_pyarrow_batches(chunk_size=1)
print(record_batch_reader.read_next_batch().schema)
assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema
a_very_very_very_very_very_very_very_very_very_very_very__1: int64
Traceback (most recent call last):
  File "<REDACTED>/minimal_example.py", line 23, in <module>
    assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

What version of ibis are you using?

❯ pip list | grep -e ibis -e duckdb
duckdb                                   0.10.0
duckdb-engine                            0.9.2
ibis-framework                           8.0.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@judahrand judahrand added the bug Incorrect behavior inside of ibis label Feb 19, 2024
@judahrand
Copy link
Author

This issue does not appear to be present in the Pandas backend.

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2024

Thanks for the issue.

What version of pyarrow are you using?

Copypasting (thank you!) your code works for me on 91f46b9 using:

  • duckdb 0.9.2
  • pyarrow 15.0.0

@judahrand
Copy link
Author

❯ pip list | grep -e pyarrow -e ibis-framework -e duckdb
duckdb                                   0.9.2
duckdb-engine                            0.9.2
ibis-framework                           8.0.0
pyarrow                                  15.0.0
pyarrow-hotfix                           0.4

❯ python minimal_example.py
a_very_very_very_very_very_very_very_very_very_very_very__1: int64
Traceback (most recent call last):
  File "<REDACTED>/minimal_example.py", line 25, in <module>
    assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

@judahrand
Copy link
Author

Thanks for the issue.

What version of pyarrow are you using?

Copypasting (thank you!) your code works for me on 91f46b9 using:

  • duckdb 0.9.2
  • pyarrow 15.0.0

That commit also is working for me! So it looks like this has been fixed on main already since the 8.0.0 release? Would it be worth me trying to bisect where it was fixed so we could back port whatever fixed it into a patch release? This issue silently broke some of our data pipelines so we're keen to see a fix available. If not I suppose we can pip install from a specific commit for now.

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2024

Bisecting will be tough, there are a number of commits where the codebase was not in a state where any CI was being run due to the refactor coming from the-epic-split branch.

We did that deliberately to incrementally re-enable each backend in CI as we ported them to sqlglot.

I will take a look at what's happening in 8.0.0 and see if I can track down the cause of the issue at least.

@cpcloud
Copy link
Member

cpcloud commented Feb 20, 2024

I was able to reproduce this without Ibis and without an explicit PyArrow import, but not with DuckDB alone.

This seems like an issue with duckdb_engine and perhaps sqlalchemy.

import sqlalchemy as sa

long_name = "a_very_very_very_very_very_very_very_very_very_long_column_name"

eng = sa.create_engine("duckdb:///test.db")

with eng.begin() as con:
    con.exec_driver_sql(f"CREATE OR REPLACE TABLE t ({long_name} INT)")

t = sa.Table("t", sa.MetaData(), autoload_with=eng)
t1 = sa.select(t.c[0]).select_from(t).subquery()
query = sa.select(t1.c[long_name])

with eng.begin() as con:
    result = con.execute(query)
    batch = result.connection.connection.fetch_record_batch()

assert batch.schema.names[0] == long_name, batch.schema.names[0]

One of the things we've done between 8.0.0 and now is completely remove Ibis's explicit dependency on SQLAlchemy, so duckdb_engine isn't in the mix anymore: we use the duckdb DBAPI directly.

I'm not sure the backport is worth the effort. I'd rather not introduce a workaround for a problem that doesn't exist.

Instead, we're happy to help get your code working against a specific commit, or you can use one of the pre-release wheels that we publish from main once a week to PyPI.

@cpcloud
Copy link
Member

cpcloud commented Feb 20, 2024

@judahrand I filed Mause/duckdb_engine#910.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants