feat(parquet): Support struct schema evolution matching by name #5962

rui-mo · 2023-08-02T06:51:51Z

The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet file format. Missing subfields
are identified by matching the file type and requested type on the names of
subfileds. When all the subfields in the requested type are missing and the
number of subfields is more than one, the struct is set as null. Otherwise,
'null' occupies the position of the missing subfields. Below table summarizes
the supported cases.

Parquet column schema	Requested output schema	Result
row({"a", "c"})	row({"a", "b", "c"})	row(a_val, null, c_val)
row({"a", "c"})	row({"b"})	row(null)
row({"a", "c"})	row({"b", "d"})	null
row({"a", "c"})	row({})	empty

netlify · 2023-08-02T06:51:56Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`b8b9fbc`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/679389a59fb0980008084717

Yuhta · 2023-08-02T17:47:02Z

velox/dwio/common/Options.h

+  /**
+   * Get the output type of row reader.
+   */
+  const RowTypePtr& getOutputType() const {


Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

Yuhta · 2023-08-02T17:51:34Z

velox/dwio/parquet/reader/StructColumnReader.cpp

  for (auto i = 0; i < childSpecs.size(); ++i) {
    if (childSpecs[i]->isConstant()) {
      continue;
    }
-    auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
+    const auto& fieldName = childSpecs[i]->fieldName();
+    if (outputType && !fileType_->containsChild(fieldName)) {


We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Thanks. That looks good to me. Convert this PR to draft for now.

…or#5962)

…t schemas (5962) facebookincubator#5962

stale · 2024-09-15T22:59:34Z

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

Yuhta · 2024-09-20T17:59:07Z

@rui-mo This behavior needs to be configurable as the default behavior should be match by index, i.e.

Parquet column schema	User-specified output schema	Result
row({"a", "c"})	row({"a", "b", "c"})	row(a:a_val, b:c_cal, c:null)
row({"a", "c"})	row({"b"})	Should not be supported, no deletion of subfields
row({"a", "c"})	row({"b", "d"})	row(b:a_val, c:d_val)
row({"a", "c"})	row({})	Should not be supported

Also we should consider doing it in a format-agnostic way.

rui-mo · 2024-09-23T09:36:28Z

@rui-mo This behavior needs to be configurable as the default behavior should be match by index, i.e.
Also we should consider doing it in a format-agnostic way.

@Yuhta Thanks for your feedback. It makes sense to me and I will take further look.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023

Yuhta self-requested a review August 2, 2023 17:44

Yuhta reviewed Aug 2, 2023

View reviewed changes

rui-mo changed the title ~~Support struct column reading with different schemas~~ [GLUTEN] Support struct column reading with different schemas Aug 4, 2023

rui-mo marked this pull request as draft August 9, 2023 02:52

rui-mo changed the title ~~[GLUTEN] Support struct column reading with different schemas~~ Support struct column reading with different schemas Aug 28, 2023

rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28

rui-mo force-pushed the wip_struct branch from c8c5132 to e65f832 Compare September 19, 2023 07:40

rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33

rui-mo force-pushed the wip_struct branch from fda6ff8 to 6dc6b0f Compare October 26, 2023 01:11

rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44

rui-mo force-pushed the wip_struct branch from 7abb820 to e847a3b Compare November 22, 2023 05:59

rui-mo force-pushed the wip_struct branch from e847a3b to 07949bb Compare January 2, 2024 09:48

rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54

rui-mo force-pushed the wip_struct branch from 0364f89 to 12ca41d Compare March 5, 2024 03:34

rui-mo force-pushed the wip_struct branch from 12ca41d to 8af647b Compare March 20, 2024 04:06

rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024

Support struct column reading with different schemas (facebookincubat…

73da57b

…or#5962)

rui-mo force-pushed the wip_struct branch from 1021b22 to f2a890c Compare April 2, 2024 08:32

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024

Support struct column reading with different schemas (facebookincubat…

86d860c

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024

Support struct column reading with different schemas (facebookincubat…

a9f262b

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024

Support struct column reading with different schemas (facebookincubat…

2b4b1a4

…or#5962)

zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

c8eb37a

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

5dd9414

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 26, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

d7a03a3

…t schemas (5962) facebookincubator#5962

zhztheplayer pushed a commit to zhztheplayer/velox that referenced this pull request Jul 27, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

5ea4764

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 29, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

ed1bac0

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 30, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

49b4379

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 31, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

bed7978

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 1, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

df0eeba

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 2, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

584944b

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 3, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

48caa64

…t schemas (5962) facebookincubator#5962

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 4, 2024

[facebookincubator#5962 ] Support struct column reading with differen…

ff4b07f

…t schemas (5962) facebookincubator#5962

stale bot added the stale label Sep 15, 2024

rui-mo force-pushed the wip_struct branch from 2a1569b to 671fab2 Compare September 19, 2024 04:07

stale bot removed the stale label Sep 19, 2024

rui-mo force-pushed the wip_struct branch from 671fab2 to 532fec2 Compare September 20, 2024 03:11

rui-mo force-pushed the wip_struct branch 2 times, most recently from 008e85f to 90ef210 Compare October 15, 2024 06:43

rui-mo mentioned this pull request Nov 14, 2024

[GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format apache/incubator-gluten#7268

Merged

rui-mo force-pushed the wip_struct branch from 90ef210 to 6f59f18 Compare December 2, 2024 06:38

rui-mo force-pushed the wip_struct branch from 6f59f18 to 8cc9ce4 Compare December 20, 2024 03:42

rui-mo marked this pull request as ready for review December 20, 2024 03:43

rui-mo requested a review from majetideepak as a code owner December 20, 2024 03:43

rui-mo changed the title ~~Support struct column reading with different schemas~~ feat: Support struct schema evolution matching by name Dec 20, 2024

rui-mo changed the title ~~feat: Support struct schema evolution matching by name~~ feat(parquet): Support struct schema evolution matching by name Dec 20, 2024

rui-mo added 2 commits January 24, 2025 20:37

Support struct column reading with different schemas

3d72f41

minor

b8b9fbc

rui-mo force-pushed the wip_struct branch from a82183b to b8b9fbc Compare January 24, 2025 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): Support struct schema evolution matching by name #5962

feat(parquet): Support struct schema evolution matching by name #5962

rui-mo commented Aug 2, 2023 •

edited

Loading

netlify bot commented Aug 2, 2023 •

edited

Loading

Yuhta Aug 2, 2023

Yuhta Aug 2, 2023

rui-mo Aug 4, 2023

Yuhta Aug 4, 2023

rui-mo Aug 7, 2023 •

edited

Loading

Yuhta Aug 8, 2023

rui-mo Aug 9, 2023

stale bot commented Sep 15, 2024

Yuhta commented Sep 20, 2024 •

edited

Loading

rui-mo commented Sep 23, 2024

feat(parquet): Support struct schema evolution matching by name #5962

Are you sure you want to change the base?

feat(parquet): Support struct schema evolution matching by name #5962

Conversation

rui-mo commented Aug 2, 2023 • edited Loading

netlify bot commented Aug 2, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

rui-mo Aug 4, 2023

Choose a reason for hiding this comment

Yuhta Aug 4, 2023

Choose a reason for hiding this comment

rui-mo Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Yuhta Aug 8, 2023

Choose a reason for hiding this comment

rui-mo Aug 9, 2023

Choose a reason for hiding this comment

stale bot commented Sep 15, 2024

Yuhta commented Sep 20, 2024 • edited Loading

rui-mo commented Sep 23, 2024

rui-mo commented Aug 2, 2023 •

edited

Loading

netlify bot commented Aug 2, 2023 •

edited

Loading

rui-mo Aug 7, 2023 •

edited

Loading

Yuhta commented Sep 20, 2024 •

edited

Loading