Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable string-based column projections from Parquet files #6871

Merged
merged 2 commits into from
Dec 18, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 12, 2024

Which issue does this PR close?

Closes #182.

It's an old issue, so perhaps this change is not wanted, in which case this can be closed.

Rationale for this change

Allows projecting columns by name rather than index.

What changes are included in this PR?

Adds a new method ProjectionMask::columns which takes a list of column names and returns a ProjectionMask.

Are there any user-facing changes?

New API call.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a userful API to me -- thank you @etseidl 🙏

message test_schema {
OPTIONAL INT32 a;
OPTIONAL INT32 b;
OPTIONAL INT32 a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a nasty thing to do (repeat the name of a field in the parquet file) but it seems to be allowed and your code handles it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not a fan of this behavior, but I think some query engines (spark perhaps) will produce duplicate names when joining tables. Necessary evil I guess.

@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

🚀 -- thanks again @etseidl

@alamb alamb merged commit cbe1765 into apache:main Dec 18, 2024
16 checks passed
CurtHagenlocher pushed a commit to CurtHagenlocher/arrow-rs that referenced this pull request Dec 28, 2024
* add function to create ProjectionMask from column names

* add some more tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String-based path column projection
2 participants