bug: `.drop()` causing mismatch in RecordBatchReader schema #8393

judahrand · 2024-02-19T17:45:47Z

What happened?

Calling .drop() on a Table which contains a column name longer than 59 characters and then calling .to_pyarrow_batches() results in a broken RecordBatchReader where the contained RecordBatches do not match the schema of the RecordBatchReader due to column name truncations. This breaks RecordBatchReader.read_all().

What is surprising, however, is that if the .drop() is the first method called after conn.register() the truncation does not seem to occur.

import ibis
import pyarrow


table = pyarrow.Table.from_pydict(
    {
        "a_short_column_name": [1, 2, 3],
        "a_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_very_long_column_name": [4, 5, 6],
    },
)

ibis.set_backend("duckdb")
ibis_table = ibis.memtable(table)

result = ibis_table.select(
    *ibis_table.schema().keys(),
).drop(
    "a_short_column_name",
)

record_batch_reader = result.to_pyarrow_batches(chunk_size=1)
print(record_batch_reader.read_next_batch().schema)
assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema

a_very_very_very_very_very_very_very_very_very_very_very__1: int64
Traceback (most recent call last):
  File "<REDACTED>/minimal_example.py", line 23, in <module>
    assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

What version of ibis are you using?

❯ pip list | grep -e ibis -e duckdb
duckdb                                   0.10.0
duckdb-engine                            0.9.2
ibis-framework                           8.0.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

judahrand · 2024-02-19T17:53:23Z

This issue does not appear to be present in the Pandas backend.

cpcloud · 2024-02-19T18:18:59Z

Thanks for the issue.

What version of pyarrow are you using?

Copypasting (thank you!) your code works for me on 91f46b9 using:

duckdb 0.9.2
pyarrow 15.0.0

judahrand · 2024-02-19T18:27:08Z

❯ pip list | grep -e pyarrow -e ibis-framework -e duckdb
duckdb                                   0.9.2
duckdb-engine                            0.9.2
ibis-framework                           8.0.0
pyarrow                                  15.0.0
pyarrow-hotfix                           0.4

❯ python minimal_example.py
a_very_very_very_very_very_very_very_very_very_very_very__1: int64
Traceback (most recent call last):
  File "<REDACTED>/minimal_example.py", line 25, in <module>
    assert record_batch_reader.schema == record_batch_reader.read_next_batch().schema
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

judahrand · 2024-02-19T18:31:04Z

Thanks for the issue.

What version of pyarrow are you using?

Copypasting (thank you!) your code works for me on 91f46b9 using:

duckdb 0.9.2

pyarrow 15.0.0

That commit also is working for me! So it looks like this has been fixed on main already since the 8.0.0 release? Would it be worth me trying to bisect where it was fixed so we could back port whatever fixed it into a patch release? This issue silently broke some of our data pipelines so we're keen to see a fix available. If not I suppose we can pip install from a specific commit for now.

cpcloud · 2024-02-19T19:23:17Z

Bisecting will be tough, there are a number of commits where the codebase was not in a state where any CI was being run due to the refactor coming from the-epic-split branch.

We did that deliberately to incrementally re-enable each backend in CI as we ported them to sqlglot.

I will take a look at what's happening in 8.0.0 and see if I can track down the cause of the issue at least.

cpcloud · 2024-02-20T12:18:21Z

I was able to reproduce this without Ibis and without an explicit PyArrow import, but not with DuckDB alone.

This seems like an issue with duckdb_engine and perhaps sqlalchemy.

import sqlalchemy as sa

long_name = "a_very_very_very_very_very_very_very_very_very_long_column_name"

eng = sa.create_engine("duckdb:///test.db")

with eng.begin() as con:
    con.exec_driver_sql(f"CREATE OR REPLACE TABLE t ({long_name} INT)")

t = sa.Table("t", sa.MetaData(), autoload_with=eng)
t1 = sa.select(t.c[0]).select_from(t).subquery()
query = sa.select(t1.c[long_name])

with eng.begin() as con:
    result = con.execute(query)
    batch = result.connection.connection.fetch_record_batch()

assert batch.schema.names[0] == long_name, batch.schema.names[0]

One of the things we've done between 8.0.0 and now is completely remove Ibis's explicit dependency on SQLAlchemy, so duckdb_engine isn't in the mix anymore: we use the duckdb DBAPI directly.

I'm not sure the backport is worth the effort. I'd rather not introduce a workaround for a problem that doesn't exist.

Instead, we're happy to help get your code working against a specific commit, or you can use one of the pre-release wheels that we publish from main once a week to PyPI.

cpcloud · 2024-02-20T14:01:03Z

@judahrand I filed Mause/duckdb_engine#910.

judahrand added the bug Incorrect behavior inside of ibis label Feb 19, 2024

github-project-automation bot added this to Ibis planning and roadmap Feb 19, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Feb 19, 2024

judahrand mentioned this issue Feb 19, 2024

fix(duckdb): to_pyarrow_batches can result in broken RecordBatchReader #8394

Closed

cpcloud closed this as completed Feb 20, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Feb 20, 2024

cpcloud mentioned this issue Feb 20, 2024

[Bug]: Long column names are truncated Mause/duckdb_engine#910

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `.drop()` causing mismatch in RecordBatchReader schema #8393

bug: `.drop()` causing mismatch in RecordBatchReader schema #8393

judahrand commented Feb 19, 2024 •

edited

Loading

judahrand commented Feb 19, 2024

cpcloud commented Feb 19, 2024 •

edited

Loading

judahrand commented Feb 19, 2024

judahrand commented Feb 19, 2024

cpcloud commented Feb 19, 2024

cpcloud commented Feb 20, 2024 •

edited

Loading

cpcloud commented Feb 20, 2024

bug: .drop() causing mismatch in RecordBatchReader schema #8393

bug: .drop() causing mismatch in RecordBatchReader schema #8393

Comments

judahrand commented Feb 19, 2024 • edited Loading

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

judahrand commented Feb 19, 2024

cpcloud commented Feb 19, 2024 • edited Loading

judahrand commented Feb 19, 2024

judahrand commented Feb 19, 2024

cpcloud commented Feb 19, 2024

cpcloud commented Feb 20, 2024 • edited Loading

cpcloud commented Feb 20, 2024

bug: `.drop()` causing mismatch in RecordBatchReader schema #8393

bug: `.drop()` causing mismatch in RecordBatchReader schema #8393

judahrand commented Feb 19, 2024 •

edited

Loading

cpcloud commented Feb 19, 2024 •

edited

Loading

cpcloud commented Feb 20, 2024 •

edited

Loading