Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change arrow map root name to follow with parquet root name #2538

Merged
merged 1 commit into from
Aug 3, 2024

Conversation

sclmn
Copy link
Contributor

@sclmn sclmn commented May 24, 2024

Description

Change root field name for map to key_value

Related Issue(s)

#2182

Documentation

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 24, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@sclmn sclmn changed the title Fix map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps. The indication is to have a key_value name instead of entries. fix: Change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps. The indication is to have a key_value name instead of entries. May 24, 2024
@sclmn sclmn changed the title fix: Change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps. The indication is to have a key_value name instead of entries. fix: change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps to use key_value instead of entries. May 24, 2024
@roeap
Copy link
Collaborator

roeap commented May 24, 2024

@sclmn - thanks for opening this PR.

Would you mind elaborating a bit on the use case / problem you are encountering? Main reason I am asking is that we have been struggling a bit to come up with a system that would be less opinionated about "internal" field names.

Here we are not really configuring names in parquet, but rather arrow, which may be written to parquet. In the arrow world, there is a difference between implementations which name is used, and it actually sometimes does not even matter.

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

@sclmn
Copy link
Contributor Author

sclmn commented May 24, 2024

@sclmn - thanks for opening this PR.

Would you mind elaborating a bit on the use case / problem you are encountering? Main reason I am asking is that we have been struggling a bit to come up with a system that would be less opinionated about "internal" field names.

Here we are not really configuring names in parquet, but rather arrow, which may be written to parquet. In the arrow world, there is a difference between implementations which name is used, and it actually sometimes does not even matter.

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

The delta lake protocol defines the type of various fields as map (eg partitionValues) and since the format of the checkpoint is parquet we expect key_value. If you look at the link above it states:

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the list is nullable.

Various code validates this to some degree (eg arrow-rs ).

@roeap
Copy link
Collaborator

roeap commented May 24, 2024

certainly, but I believe we may be mixing things up here a bit, so just making sure we are talking about the same things here.

  • the spec you reference is about the parquet - i.e. how data is represented on disk.
  • The parquet reader link from the arrow-rs crate checks for the existence of a single child, regardless its name. also subsequent processing is order based.
  • the link to the delta protocol has no opinion about the actual in-memory representation of the data.

SO the thing we want to change in this PR, only controls how data is represented in memory. The name entries is in accordance with how arrow-rs models map arrays - albeit I think arrow-rs does not require that specific name. IIRC, different arrow implementations made different choices.

SO ultimately I believe it is the responsibility of the parquet writer, to make sure that the "Inner" map field gets the right name according to the spec.

That said, we have seen several instances now, where the mismatch in conventions how these inner fields are named (also applied to the varies lists variants in parquet etc.) became quite bothersome.

As such I'd be keen to understand a little bit better where you are hitting an error based on this name to finally find a sustainable solution how we can handle the various schema conversions.

@sclmn
Copy link
Contributor Author

sclmn commented May 24, 2024

I'm trying to fix the on disk result (working on supporting this at BigQuery). The current code generates delta lake checkpoints that have an improper parquet schema. For instance, the tags field looks like:
optional group field_id=-1 tags (Map) {
repeated group field_id=-1 entries {
required binary field_id=-1 key (String);
optional binary field_id=-1 value (String);
}
}

@sclmn sclmn changed the title fix: change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps to use key_value instead of entries. fix: change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps to use key_value instead of entries May 29, 2024
@emkornfield
Copy link

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

In regards to this, the Arrow specification is a lot more lenient on the the layout of maps than parquet.

@alamb @tustvold are there any flags in Parquet-rs to make it transparently write map types with compliant naming?

@alamb
Copy link

alamb commented May 30, 2024

@alamb @tustvold are there any flags in Parquet-rs to make it transparently write map types with compliant naming?

Not that I know of. I think it will typically write whatever is in the arrow schema

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 3, 2024

I think we should keep the Parquet field names here, cause we are going back and forth between arrow and parquet it's good to keep this consistent.

As @roeap has pointed out there is no real right here, but it seems that using parquet names gives less problems then using the arrow names.

This already happens here as well: delta-incubator/delta-kernel-rs#299, because most of our code expects the arrow datatypes to be structured the way parquet datatypes are, we are unfortunately getting schema mismatches when we try to cast recordbatches

@ion-elgreco ion-elgreco changed the title fix: change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps to use key_value instead of entries fix: change arrow map root name to comply with parquet root name Aug 3, 2024
@ion-elgreco ion-elgreco changed the title fix: change arrow map root name to comply with parquet root name fix: change arrow map root name to follow with parquet root name Aug 3, 2024
@ion-elgreco ion-elgreco added this pull request to the merge queue Aug 3, 2024
Merged via the queue into delta-io:main with commit dada188 Aug 3, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants