Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Option To Coerce List Type on Parquet Write #6733

Closed
ggreco opened this issue Nov 15, 2024 · 8 comments · Fixed by #6828
Closed

Add Option To Coerce List Type on Parquet Write #6733

ggreco opened this issue Nov 15, 2024 · 8 comments · Fixed by #6828
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@ggreco
Copy link

ggreco commented Nov 15, 2024

Describe the bug
arrow-rs generated .parquet files where the schema implies a nested structure should call the list item element as of parquet specifications:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

... the files generated are instead using item, probably some legacy code was used to build the code.

A similar issue has been recently fixed in polars-rs:
pola-rs/polars#17803

Pyarrow let you use item instead of element (default) to support legacy files, but IMHO arrow-rs should not generate legacy parquet files ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
):
image

The code in arrow-rs that implement this is:
https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L147

IMHO the fix will just involve a single line change, I can create a PR, but I want to be sure I'm not reading the specs in the wrong way or there is a reason for hardcoding item since it seems too simple...

To Reproduce
Generate a nested parquet file, or use the one attached to this issue and verify (with an hex editor, parquet-schema from this REPO or with a GUI tool that shows the parquet schema like "parquet floor"), that the type name associated to the list item is always item instead of element.

Using
example_parquet.zip
the file attached to this ticket that follow the schema will be reported by arrow-schema :

{
  REQUIRED BYTE_ARRAY school (STRING);
  REQUIRED group students (LIST) {
    REPEATED group list {
      OPTIONAL group item {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
  REQUIRED group teachers (LIST) {
    REPEATED group list {
      OPTIONAL group item {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
}

the expected value was:

{
  REQUIRED BYTE_ARRAY school (STRING);
  REQUIRED group students (LIST) {
    REPEATED group list {
      OPTIONAL group element {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
  REQUIRED group teachers (LIST) {
    REPEATED group list {
      OPTIONAL group element {
        REQUIRED BYTE_ARRAY name (STRING);
        REQUIRED INT32 age;
      }
    }
  }
}

I can get parquet-schema to output element instead of item when generating the parquet file from python or .net.

In the hex editor you will see students.list.item.name instead of the expected students.list.element.name.

@ggreco ggreco added the bug label Nov 15, 2024
@tustvold
Copy link
Contributor

tustvold commented Nov 15, 2024

So I don't think this is a bug per-se, the parquet writer converts the arrow schema faithfully into parquet, preserving the field name of the list elements.

The problem arises because the default within the arrow ecosystem is to call this "item" and not "element".

>>> import pyarrow as pa
>>> pa.list_(pa.string())
ListType(list<item: string>)
>>> pa.list_(pa.string()).field(0).name
'item'

The reason this matters is because the parquet schema is authoritative, that is when reading back a parquet file with a field name of "element", the arrow schema should reflect this. Therefore if we coerced to "element" on write the schema would not roundtrip as people might expect.

In the short-term, if you adjust your arrow schema to use "element", the parquet writer will follow this.

Longer term I think the way to handle this is probably #1938, where we add an option to coerce arrow types to be more compatible with parquet's type system, with the understanding that things may not always roundtrip completely faithfully.

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Nov 15, 2024
@tustvold tustvold changed the title Generated nested parquet files use ITEM instead of ELEMENT for list nodes Add Option To Coerce List Type on Parquet Write Nov 15, 2024
@ggreco
Copy link
Author

ggreco commented Nov 18, 2024

BTW I tried to generate an arrow document in C++ and:

  • use the C interface to write the parquet file with zero-copy with rust
  • simply publish the parquet file with c++

In the first path the lists contain struct objects called item, while in the second path the lists contain struct objects called element, so the C++ parquet writing logic already does this conversion to better support the parquet standard.

I've found that at least a major parquet reader created following the parquet specs and not the arrow ones ( https://aloneguid.github.io/parquet-dotnet/starter-topic.html ) cannot properly read the nested parquet files created with arrow-rs, so I think the bug label should probably stay.

@tustvold
Copy link
Contributor

tustvold commented Nov 18, 2024

That is a bug in that parquet implementation, it should handle this case:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules

I agree the situation is unfortunate, parquet should have defined these things from the start, but it didn't, and efforts are in fact still ongoing to refine the specification w.r.t nested schemas.

Edit: TBC the reason the distinction matters is if this was a bug we could change this behaviour in a patch release, we cannot do this as it will break people's schemas

@etseidl
Copy link
Contributor

etseidl commented Dec 2, 2024

@tustvold @alamb do you think the coerce_types option added in #6313 would work for this, or should there be a separate coerce_names for lists and maps? The former seems to be for where data is lost, whereas the naming only results in a metadata loss.

@tustvold
Copy link
Contributor

tustvold commented Dec 2, 2024

I'd vote for putting this under the same coerce_types, but I don't feel strongly. From arrow's perspective the name is part of the type

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

I agree the existing coerce_types is a good location

Thank you for looking into this @etseidl

i think we could add some more documentation / tests to explain what this flag means too and help people find it more easily

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'parquet'} from #6828

@alamb alamb added the arrow Changes to the arrow crate label Dec 17, 2024
@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'arrow'} from #6814

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants