Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Variant shredding and refactor for clarity #461
base: master
Are you sure you want to change the base?
Simplify Variant shredding and refactor for clarity #461
Changes from 10 commits
d94f159
b91b9c4
73d4213
e1d2b74
6f13efb
ab126d0
d4afb1a
3462b46
012eb3c
ce706e0
5cdd682
2858477
4f5611d
413139a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, I don't think this makes sense. A Variant value may have any structure. We cannot know ahead of time (when setting the Parquet type) that all parts of all encoded Variant values will be shredded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion might be around what this is intended to mean. Whether the column can be left out entirely, or whether we are talking about the cardinality specifier of (required/optional/repeated). Maybe we can clarify this as saying.
If so we can maybe clarify that the field maybe be annotated as required or optional. When it is annoted as required there is no shredding. It only optional to when shredding might be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or maybe clarify that field must always be present but the cardinality changes depending on if the field is shredded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to make it clear that this is referring to the repetition level. There are also examples, so I think that it is unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For exact numerics, we should allow truncating trailing zeros. For example,
int8
value1
anddecimal(5,2)
value100
can both be represented as a JSON value1
.Also, should the example be quoted to stay consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent of considering
Exact Numeric
to be a single logical type is that we consider theint8
value1
to be logically equivalent todecimal(5,2)
with unscaled value100
. If that's the case, I think we'd want the produced JSON to be the same for both (probably1
in both cases), and not recommend having the fraction match the scale.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gene-db, @cashmand, these are concerns for the engine layer, not for storage. If Spark wants to automatically coerce between types that's fine, but the compromise that we talked about a couple months ago was to leave this out of the shredding spec and delegate the behavior to engines. Storage should always produce the data that was stored, without modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the engine should be the one concerned with changing types.
However, my original question was about this JSON representation wording. Currently, the
Representation requirements
for anExact Numeric
says theDigits in fraction must match scale
. However, because theExact Numeric
is considered a logical type, the value1
could be stored in the Variant asint8
1 ordecimal(5,2)
100. Both of those would be the same numeric value, so we should allow truncating trailing zeros in the JSON representation, instead of requiring that the digits in the fraction match the scale.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gene-db, the JSON representation should match the physical type as closely as possible. The reader can interpret the value however it chooses to, but a storage implementation should not discard the information.
If you want to produce 34 from 34.00 stored as
decimal(9, 2)
then the engine is responsible for casting the value toint8
and then producing JSON. The JSON representation for the originaldecimal(9, 2)
value is34.00
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I am confused with this JSON chart then. If we are talking about "storage implementation", then are you expecting there is a "storage implementation" that is converting variant values to JSON? When will storage convert a variant value to a JSON string?
I originally thought this chart was trying to say, "When an engine wants to convert a variant value to a JSON string, here are the rules". Therefore, we should allow engines to cast integral decimals to integers before converting to JSON, as you already mentioned in your previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @gene-db on this. I think any json representation that has semantically the same meaning in JSON should be allowed. Translation to JSON is inherently lossy and I think trying to match semantics will be more error prone then it is worth (i.e. it should be a non-goal to expect it to be able to reconstruct the exact same variant from the proposed JSON representation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe the wording or presentation of this mapping is a bit confusing.
I think we are on all on the same page of allowing engines to "normalize" the Variant value. For example, I think the Spark implementation already normalizes
1.00
to1
. There are also many optimizations and efficiency aspects with normalization, so we should not disallow that.Maybe what this chart is trying to show is: "if you want to output a Variant value as a JSON string, this is the output format you should use". So, for numbers, the conversion should be like
1
or1.23
(no quotes), not"1"
, or"1.23"
. If this chart was about the JSON output formatting, would that be more clear?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct. We want a clear way to convert to a JSON string. However, the normalization needs to happen first. We don't want to specify that the JSON must be any more lossy than it already is.
Why would we require an engine to produce a normalized value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for me, I don't think it is about "requiring" and engine to produce a normalized value first. I think if an engine is reading variant and converting it to JSON, it is possibly doing so through an internal representation so it can still apply operators on top of the JSON value and possibly even storing it as an internal representation. Conversion to a string is really only an end-user visible thing. So when I read this it seems to be requiring an engine to NOT normalize which could be hard to implement for some engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth noting that relying on the json values here might be lossy for some numbers (this is implied but people still ask questions ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the other alternative here is a hex encoded string which will always be bigger, base64 seems reasonable just want to call this out in case people have preferences.