Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Support struct schema evolution matching by name #5962

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Aug 2, 2023

The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet file format. Missing subfields
are identified by matching the file type and requested type on the names of
subfileds. When all the subfields in the requested type are missing and the
number of subfields is more than one, the struct is set as null. Otherwise,
'null' occupies the position of the missing subfields. Below table summarizes
the supported cases.

Parquet column schema Requested output schema Result
row({"a", "c"}) row({"a", "b", "c"}) row(a_val, null, c_val)
row({"a", "c"}) row({"b"}) row(null)
row({"a", "c"}) row({"b", "d"}) null
row({"a", "c"}) row({}) empty

@netlify
Copy link

netlify bot commented Aug 2, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b8b9fbc
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/679389a59fb0980008084717

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023
@Yuhta Yuhta self-requested a review August 2, 2023 17:44
/**
* Get the output type of row reader.
*/
const RowTypePtr& getOutputType() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

for (auto i = 0; i < childSpecs.size(); ++i) {
if (childSpecs[i]->isConstant()) {
continue;
}
auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
const auto& fieldName = childSpecs[i]->fieldName();
if (outputType && !fileType_->containsChild(fieldName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

Copy link
Collaborator Author

@rui-mo rui-mo Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That looks good to me. Convert this PR to draft for now.

@rui-mo rui-mo changed the title Support struct column reading with different schemas [GLUTEN] Support struct column reading with different schemas Aug 4, 2023
@rui-mo rui-mo marked this pull request as draft August 9, 2023 02:52
@rui-mo rui-mo changed the title [GLUTEN] Support struct column reading with different schemas Support struct column reading with different schemas Aug 28, 2023
@rui-mo rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 26, 2024
zhztheplayer pushed a commit to zhztheplayer/velox that referenced this pull request Jul 27, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 29, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 30, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 31, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 1, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 2, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 4, 2024
Copy link

stale bot commented Sep 15, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@Yuhta
Copy link
Contributor

Yuhta commented Sep 20, 2024

@rui-mo This behavior needs to be configurable as the default behavior should be match by index, i.e.

Parquet column schema User-specified output schema Result
row({"a", "c"}) row({"a", "b", "c"}) row(a:a_val, b:c_cal, c:null)
row({"a", "c"}) row({"b"}) Should not be supported, no deletion of subfields
row({"a", "c"}) row({"b", "d"}) row(b:a_val, c:d_val)
row({"a", "c"}) row({}) Should not be supported

Also we should consider doing it in a format-agnostic way.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Sep 23, 2024

@rui-mo This behavior needs to be configurable as the default behavior should be match by index, i.e.
Also we should consider doing it in a format-agnostic way.

@Yuhta Thanks for your feedback. It makes sense to me and I will take further look.

@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 008e85f to 90ef210 Compare October 15, 2024 06:43
@rui-mo rui-mo marked this pull request as ready for review December 20, 2024 03:43
@rui-mo rui-mo requested a review from majetideepak as a code owner December 20, 2024 03:43
@rui-mo rui-mo changed the title Support struct column reading with different schemas feat: Support struct schema evolution matching by name Dec 20, 2024
@rui-mo rui-mo changed the title feat: Support struct schema evolution matching by name feat(parquet): Support struct schema evolution matching by name Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants