-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet): Support struct schema evolution matching by name #5962
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
velox/dwio/common/Options.h
Outdated
/** | ||
* Get the output type of row reader. | ||
*/ | ||
const RowTypePtr& getOutputType() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested type is available as getSelector()->getSchemaWithId()->type
. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.
for (auto i = 0; i < childSpecs.size(); ++i) { | ||
if (childSpecs[i]->isConstant()) { | ||
continue; | ||
} | ||
auto childDataType = fileType_->childByName(childSpecs[i]->fieldName()); | ||
const auto& fieldName = childSpecs[i]->fieldName(); | ||
if (outputType && !fileType_->containsChild(fieldName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. Does that mean for a row(a, c)
struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)
? In Spark, there is no such limitation to extra child fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b
, it adds a projection node with Alias
expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With matching by name you need to know all the old field names (a
in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That looks good to me. Convert this PR to draft for now.
c03152b
to
c8c5132
Compare
2168dc9
to
fda6ff8
Compare
a8174d3
to
7abb820
Compare
d307831
to
0364f89
Compare
e7eab9e
to
1021b22
Compare
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
@rui-mo This behavior needs to be configurable as the default behavior should be match by index, i.e.
Also we should consider doing it in a format-agnostic way. |
008e85f
to
90ef210
Compare
The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet file format. Missing subfields
are identified by matching the file type and requested type on the names of
subfileds. When all the subfields in the requested type are missing and the
number of subfields is more than one, the struct is set as null. Otherwise,
'null' occupies the position of the missing subfields. Below table summarizes
the supported cases.