Simplify Variant shredding and refactor for clarity #461

rdblue · 2024-10-20T22:17:19Z

Rationale for this change

Updating the Variant and shredding specs from a thorough review.

What changes are included in this PR?

Spec updates, mostly to the shredding spec to minimize it and make it clear. This also attempts to make the variant spec more consistent (for example, by using value in both).

Removes object and array in favor of always using typed_value
Makes list element and object field groups required to avoid unnecessary null cases
Separates cases for primitives, arrays, and objects
Adds individual examples for primitives, arrays, and objects
Adds Variant to Parquet type mapping for shredded columns
Clarifies that metadata must be valid for all variant values without modification
Updates reconstruction algorithm to be more pythonic

Do these changes have PoC implementations?

No.

VariantEncoding.md

VariantShredding.md

rdblue · 2024-10-20T22:24:32Z

VariantShredding.md

-We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups.
-Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing.
+All fields for a variant, whether shredded or not, must be present in the metadata.


This may be controversial. I'm trying to say that you should not need to modify the metadata when reading. The reconstructed object should be able to use the stored metadata without adding fields.

I'm a little confused. When the field is not shredded, we will not have metadata for it, right? When it's getting shredded, then it will be like a column and we will generate metadata so it can be used for filtering/pruning?

@sfc-gh-aixu, this is saying that when writing, the metadata for a shredded value and the metadata for a non-shredded value should be identical. Writers should not alter the metadata by removing shredded field names so that readers do not need to rewrite the metadata (and values) to add it back.

For example, consider an event that looks like this:

{ "id": 102, "event_type": "signup", "event_timestamp": "2024-10-21T20:06:34.198724", "payload": { "a": 1, "b": 2 } }

And a shredding schema:

optional group event (VARIANT) { required binary metadata; optional binary value; optional group typed_value { required group event_type { optional binary value; optional binary typed_value (STRING); } required group event_timestamp { optional binary value; optional int64 typed_value (TIMESTAMP(true, MICROS)); } } }

The top-level event_type and event_timestamp fields are shredded. But this is saying that the Variant metadata must include those field names. That ensure that the existing binary metadata can be returned to the engine without adding event_type and event_timestamp fields when merging those fields into the top-level Variant value when the entire Variant is projected.

Thanks for detailed explanation. Later I realize this is about variant metadata and what I was talking about was column metadata (stats).

I get what you are saying: when the entire Variant is projected, we need to reconstruct the original value and metadata by merging back the shredded fields if the metadata after shredding excludes the shredded fields.

That makes sense to me to reduce the metadata reconstruction on the read side.

VariantEncoding.md

sfc-gh-saya · 2024-10-22T20:36:04Z

VariantShredding.md


-Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
+Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.


Why each shredded field should be a required group is not clear to me. If fields were allowed to be optional, that would be another way of indicating non-existence of fields.

The primary purpose is to reduce the number of cases that implementers have to deal with. If all of the cases can be expressed with 2 optional fields rather than 2 optional fields inside an optional group, then the group should be required to simplify as much as possible.

In addition, every level in Parquet that is optional introduces another repetition/definition level. That adds up quickly with nested structures and ends up taking unnecessary space.

VariantShredding.md

VariantEncoding.md

VariantShredding.md

rdblue · 2024-10-24T20:50:13Z

VariantShredding.md

-The `typed_value` field may be any type that has a corresponding Variant type.
-For each value in the data, at most one of the `typed_value` and `variant_value` may be non-null.
-A writer may omit either field, which is equivalent to all rows being null.
+If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.


@RussellSpitzer and @gene-db, this could use some attention.

Here, if both value and typed_value are non-null I initially thought it made more sense to prefer value because it doesn't need to be re-encoded and may have been coerced by an engine to the shredded type.

However, this conflicts with object fields, where the value of typed_value is preferred so that data skipping is correct. If the object's value could contains a field that conflicts with a sub-field's typed_value there is no way of knowing from field stats. If we preferred the field value stored in the object's value then data skipping could be out of sync with the value returned in the case of a conflict.

the value is invalid

Suggested change

If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.

If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`.

Why aren't we just being proscriptive here? Isn't this essentially saying you can duplicate a subfield-field between typed_value and value? Wouldn't it be safer to just say this cannot be done?

The problem is that readers won't actually implement restrictions like this and we can't fully prevent it. It is invalid for a writer to produce a value where value and typed_value conflict. But writer bugs happen and readers need to know what to do when they encounter that situation. Otherwise we would get different behaviors between readers that are processing the same data file.

It all comes down to end users -- if a writer bug produces data like this, readers will implement the ability to read because the data still exists and can be recovered. When that happens, we want to know how it is interpreted.

My take is if readers have bugs and produce invalid values, I'm not sure you can really trust most of the data at all (even metadata). It sounds like we are assuming 1 specific type of bug where readers accidentally forget to clear a field?

The point is not the bug. It is that we want to make it valid to read a projection without checking the value for bugs.

IMHO, trying to make the projection of an object to another one without having to read the value is too spark specific. For example, with keys "a" and "b" shredded, If I am casting {"a":2, "b":3, "c":4} to a struct with keys "a" and "b", I can easily imagine a cast semantic that will fail that cast and such a semantic will force us reading both the typed_value and value unless value is null.

I don't think I understand. If a and b are shredded, then there should be no fields with that name in value and the spec is stating that you don't need to check for them. That means all of the information needed to continue is in the shredded Parquet columns. That's not specific to Spark.

What I was trying to say is, if I am casting that field to a struct with "a" and "b" but "c" exists in value, then some engines will require that cast to fail, because the object with keys "a", "b" and "c" do not match the struct schema. Apparently, with spark such a cast succeeds by producing the struct with the subset of the keys that match the struct schema.

gene-db · 2024-10-25T02:00:44Z

VariantEncoding.md

+|---------------|-----------|----------------------------------------------------------|--------------------------------------|
+| Null type     | null      | `null`                                                   | `null`                               |
+| Boolean       | boolean   | `true` or `false`                                        | `true`                               |
+| Exact Numeric | number    | Digits in fraction must match scale, no exponent         | `34`, 34.00                          |


For exact numerics, we should allow truncating trailing zeros. For example, int8 value 1 and decimal(5,2) value 100 can both be represented as a JSON value 1.

Also, should the example be quoted to stay consistent?

Suggested change

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, 34.00 |

| Exact Numeric | number | Digits in fraction must match scale, no exponent | `34`, `34.00` |

I think the intent of considering Exact Numeric to be a single logical type is that we consider the int8 value 1 to be logically equivalent to decimal(5,2) with unscaled value 100. If that's the case, I think we'd want the produced JSON to be the same for both (probably 1 in both cases), and not recommend having the fraction match the scale.

@gene-db, @cashmand, these are concerns for the engine layer, not for storage. If Spark wants to automatically coerce between types that's fine, but the compromise that we talked about a couple months ago was to leave this out of the shredding spec and delegate the behavior to engines. Storage should always produce the data that was stored, without modification.

Yes, the engine should be the one concerned with changing types.

However, my original question was about this JSON representation wording. Currently, the Representation requirements for an Exact Numeric says the Digits in fraction must match scale. However, because the Exact Numeric is considered a logical type, the value 1 could be stored in the Variant as int8 1 or decimal(5,2) 100. Both of those would be the same numeric value, so we should allow truncating trailing zeros in the JSON representation, instead of requiring that the digits in the fraction match the scale.

@gene-db, the JSON representation should match the physical type as closely as possible. The reader can interpret the value however it chooses to, but a storage implementation should not discard the information.

If you want to produce 34 from 34.00 stored as decimal(9, 2) then the engine is responsible for casting the value to int8 and then producing JSON. The JSON representation for the original decimal(9, 2) value is 34.00.

@rdblue I am confused with this JSON chart then. If we are talking about "storage implementation", then are you expecting there is a "storage implementation" that is converting variant values to JSON? When will storage convert a variant value to a JSON string?

I originally thought this chart was trying to say, "When an engine wants to convert a variant value to a JSON string, here are the rules". Therefore, we should allow engines to cast integral decimals to integers before converting to JSON, as you already mentioned in your previous comment.

I think I agree with @gene-db on this. I think any json representation that has semantically the same meaning in JSON should be allowed. Translation to JSON is inherently lossy and I think trying to match semantics will be more error prone then it is worth (i.e. it should be a non-goal to expect it to be able to reconstruct the exact same variant from the proposed JSON representation).

I think maybe the wording or presentation of this mapping is a bit confusing.

I think we are on all on the same page of allowing engines to "normalize" the Variant value. For example, I think the Spark implementation already normalizes 1.00 to 1. There are also many optimizations and efficiency aspects with normalization, so we should not disallow that.

Maybe what this chart is trying to show is: "if you want to output a Variant value as a JSON string, this is the output format you should use". So, for numbers, the conversion should be like 1 or 1.23 (no quotes), not "1", or "1.23". If this chart was about the JSON output formatting, would that be more clear?

When an engine wants to convert a variant value to a JSON string, here are the rules

Yes, this is correct. We want a clear way to convert to a JSON string. However, the normalization needs to happen first. We don't want to specify that the JSON must be any more lossy than it already is.

Why would we require an engine to produce a normalized value?

Why would we require an engine to produce a normalized value?

At least for me, I don't think it is about "requiring" and engine to produce a normalized value first. I think if an engine is reading variant and converting it to JSON, it is possibly doing so through an internal representation so it can still apply operators on top of the JSON value and possibly even storing it as an internal representation. Conversion to a string is really only an end-user visible thing. So when I read this it seems to be requiring an engine to NOT normalize which could be hard to implement for some engines.

VariantShredding.md

cashmand · 2024-10-28T18:52:41Z

VariantShredding.md


-Dictionary IDs in a `variant_value` field refer to entries in the top-level `metadata` field.
+If a Variant is missing in a context where a value is required, readers must either fail or return a Variant null: basic type 0 (primitive) and physical type 0 (null).
+For example, if a Variant is required (like `measurement` above) and both `value` and `typed_value` are null, the returned `value` must be `00` (Variant null).


As mentioned in my previous comment, I think it would be invalid for measurement to have both value and typed_value be null, and should be an error. I don't understand why we're recommend returning variant null as an option.

This rule is to address the fact that arrays cannot contain a missing value. This is saying that if a value is required but both are null, the implementation must fill in a variant null.

This rule(both value being null should be interpreted as json-null) is valid only for top level variant and array elements? I wonder how a top level variant can be inserted as both value and typed_value being null if the top level field is required. That seems inconsistent. For arrays, it looks like we could also require value being variant encoded null(json null) rather than allowing both fields to be null.

@sfc-gh-saya, if the top-level field is required but both fields are null, then the reader must produce a variant null value, 00. We must state what happens in cases like this because it is possible for writers to produce them.

If the writers produce nulls for both value and typed_value , it's like a corrupted files and I feel it's reasonable for the readers to error out rather than give a default value.

VariantShredding.md

RussellSpitzer · 2024-11-05T20:47:57Z

VariantShredding.md

-At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
-These represent a fixed schema suitable for constructing the full Variant value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.


Suggested change

Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.

Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages.

VariantShredding.md

RussellSpitzer · 2024-11-05T22:47:26Z

VariantShredding.md

-# Data Skipping
+All elements of an array must be non-null because `array` elements in a Variant cannot be missing.
+That is, either `typed_value` or `value` (but not both) must be non-null.
+Null elements must be encoded in `value` as Variant null: basic type 0 (primitive) and physical type 0 (null).


just for consistency it was written as

`00` (Variant null).

Earlier in the doc but this is fine too

RussellSpitzer · 2024-11-05T23:06:55Z

VariantShredding.md

+| `{"error_msg": "malformed: ..."}`                                                  | `{"error_msg", "malformed: ..."}` | null          |                                |                                      |                              |                                    | Object with no shredding                      |
+| `"malformed: not an object"`                                                       | `malformed: not an object`        | null          |                                |                                      |                              |                                    | Not an object (stored as Variant string)      |
+| `{"event_ts": 1729794240241, "click": "_button"}`                                  | `{"click": "_button"}`            | non-null      | null                           | null                                 | null                         | 1729794240241                      | Field `event_type` is missing                 |
+| `{"event_type": null, "event_ts": 1729794954163}`                                  | null                              | non-null      | `00` (field exists, is null)   | null                                 | null                         | 1729794954163                      | Field `event_type` is present and is null     |


Some more requested examples,

Could we have where "event_ts" is a Date or something non transformable into a timestamp?
I assume this would make value be {"event_ts": "08-03-2025"} while typed_value would be null

I also wonder if we could do a single example for a doubly nested field showing where typed_value.address.value != null. All the examples here cover a primitive field being typed, so It may be nice to show the behavior with a object being typed.

{ Name Address { City ZIP (Shredded as INT but some values as String?) } }

I added most other examples, but I don't think that we need the nested example because it would make the table much larger. I also cover nesting in the next section specifically.

RussellSpitzer · 2024-11-05T23:10:47Z

VariantShredding.md


+The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above.


I don't think I understand this sentence, but I believe I understand the intent is that you can have objects or elements within arrays also shredded?

I think the tables above are easier for me to follow than the parquet schema below. I understand though if that's difficult to depict.

This is just saying that any time you have a value field, you can also have a typed_value field that might be any shredded type, like an array nested in a field or an object nested in an array.

RussellSpitzer · 2024-11-06T16:04:21Z

VariantShredding.md


-Consider the following example:
+Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).


Do we need to specify "null" vs "variant null" I get a little confused sometimes in the doc.

Isn't "null (missing)" clear that it is not variant null? Missing is only used to mean one form of null in the text.

RussellSpitzer · 2024-11-06T16:34:40Z

VariantShredding.md

-  “not an object”
-]
-```
+When the corresponding `value` column is all nulls, all values must be the shredded `typed_value` field's type.


Sometimes we refer to the value as a column and sometimes as a field. Just wondering if we should take a pass to standardize unless there is another meaning i'm not following here.

I think I was referring to "value" as a Parquet column. All object variant fields should be referred to as "field".

it might be good to have this in a glossary at the top.

VariantShredding.md

emkornfield · 2024-11-23T01:28:07Z

VariantShredding.md

-When reconstructing, it would be valid for a reader to reconstruct 123 as an integer, or as a Decimal(9, 2).
-Engines should not depend on the physical type of a Variant value, only the logical type.
+If the value is a partially shredded object, the `value` must not contain the shredded fields.
+If shredded fields are present in the variant object, it is invalid and readers must either fail or use the shredded values.


I think we should probably prefer failing.

The purpose is to establish that the shredded values can always be trusted and take precedence.

That makes it possible to project just the necessary Parquet columns. If failure is required, then readers always need to project the value column to detect the error cases to throw. Because we don't want to do that for most cases, where values are not invalid, the solution is to state that failing is not a requirement and that shredded values are the correct values.

Suggested change

If shredded fields are present in the variant object, it is invalid and readers must either fail or use the shredded values.

Readers are not required to check for overlap of between the fields in a `value` object and the `typed_value` schema. If there is an overlap behavior of the reader is undefined.

This makes sense, I think we should just state the requirement that the check is not necessary and that it is undefined behavior if there is overlap.

I like the part about not required to check, but the behavior cannot be undefined. If you don't have to check, then it should be a valid read and we need to know what value should be returned.

VariantShredding.md

emkornfield · 2024-11-23T04:57:46Z

VariantShredding.md


-Below is a sample of JSON that would be fully shredded in this example.
-It contains an array of objects, containing an `a` field shredded as an array, and a `b` field shredded as an integer.
+Readers can assume that a value is not an object if `typed_value` is null and that `typed_value` field values are correct; that is, readers do not need to read the `value` column if `typed_value` fields satisfy the required fields.


I don't think I understand this, is it possible to illustrate with an example?

This is again establishing that the shredded values can always be trusted even if there is an encoding problem.

For example, take the case above where two fields of an object are being projected. This is saying that it is not necessary to read the object's value field to check for invalid cases:

If the shredded fields are defined but value does not contain an object

If the shredded fields are duplicated or conflicting with fields in a value object

Thanks for the clarification. I think maybe this should be grouped with line 159 to make the context clearer?

VariantShredding.md

emkornfield · 2024-11-24T12:08:59Z

VariantShredding.md

-By the same token, `variant_value` may be absent, which is equivalent to their value being always null (in which case the field will always have the value Null or have the type of the `typed_value` column).
+| Variant Type                | Equivalent Parquet Type      |
+|-----------------------------|------------------------------|
+| boolean                     | BOOLEAN                      |


I think this goes back to earlier debates about allowed lossiness at the storage layer but it seems we might lose a lot of benefits to shredding if a single variant has multiple exact integer representations (e.g. mix of int32 and int64)

I think the state of this is that the storage will not do any conversions at all. However, the engine itself is allowed to "normalize" variants to optimize. In this case, engines will probably normalize within "exact numerics" in order to make shredding more effective.

As I digest the types, I feel like the concept of exact numeric is purely an engine concern and is an entirely new concept for parquet that isnt really explicitly modelled anyplace in the spec. Given this I think my preferences would be:

Remove the concept of this type of equality from the parquet spec and leave it in the engine.

Make conversion to JSON an appendix as a recommendation

Make sure we design the variant in parquet to optimized with shredding independent from the engine (I think this means adding a type identifier discriminant field for the shredded column) also as I stated in another comment, it would be good to ensure the bit order for values makes it so that we can make the widest use of stats (e.g. type info is in the most significant bits) for the non-shredded value (iiuc this might require a second version for the non-shredded value, depending on what the current bit order means)

Thoughts?

The intent is to leave normalization and conversion to engines, but I don't think that we need to remove the Variant logical types in order to do that.

I also think that we need to have a reliable JSON representation or else we will get different behavior and conventions across engines. We want people to be able to rely on what the JSON means, even if it is necessarily lossy in some cases.

The intent is to leave normalization and conversion to engines, but I don't think that we need to remove the Variant logical types in order to do that.

I think this might just be a philosophical difference of opinion. But I believe Parquet implementations should have the flexibility to appropriately optimize across physical type.

I think we should either:

Not have the notion of logical type correspondence as part of variant in parquet. Under this assumption we should provide enough modeling that parquet can do a reasonable job shredding and reproducing the exact physical types after shredding. Parquet is used in a lot of contexts and I think in many cases the engines might not be very sophisticate here.

Not require parquet to reproduce the exact physical types when reading back values (i.e. allow it to normalization).

Both approaches have there strengths, but what I would like to avoid is a tight coupling between the engine and parquet libraries to achieve efficient storage of variants.

The original motivation for introducing the "logical type" for the Variant binary encoding was to separate the "logical" data value, from the "physical" storage of the value, so that optimizations may be performed without altering the "logical" value. The value is important, not the storage type. For example, it shouldn't matter if the string "hello" is stored as a short-string or a string, and it shouldn't matter if the integer 1 is stored as an int8 or int16. Within the same "logical type" the equivalent value can be stored in different ways.

We haven't fully defined the equality semantics yet, but the goal for the logical types is that if we had 2 Variant values:

"hello" (as a short-string)

"hello" (as a string)
those Variant values are considered equivalent, even though they are physically encoded differently.

@gene-db This makes sense. I think we need to decide at the parquet level to either:

Use this definition of logical type and make it valid for parquet to do normalization (e.g. physical types put in are not necessarily physical types you get out).

Make it possible for parquet to do lossless normalization (e.g. store values in the widest type that makes sense, and then also store metadata to retrieve types).

Use this definition of logical type and make it valid for parquet to do normalization (e.g. physical types put in are not necessarily physical types you get out).

My opinion is that parquet the format should not "do" normalization (though it would be fine if the writer implementations did)

I think the state of this is that the storage will not do any conversions at all. However, the engine itself is allowed to "normalize" variants to optimize.

I believe this writeup from @gene-db describes what I would expect from a formrat. In summary:

The spec itself doesn't require / permit any normalization of values

The implementations of writers (e.g. parquet-rs or java-rs) could have options to automatically coerce types

My rationale is that one normalization scheme is unlikely to work for all usecases, so mandating something in the spec will unnecessarily constrain users

@etseidl, @dsgibbons and others have been heading this way in parquet-rs recently (for example apache/arrow-rs#6828) when normalizing data / schema might be preferred for wider compatibility, but users can still choose to use the raw data and maintain lossless roundtrip)

VariantEncoding.md

gene-db · 2024-11-25T18:49:15Z

VariantEncoding.md

+|---------------|-----------|----------------------------------------------------------|--------------------------------------|
+| Null type     | null      | `null`                                                   | `null`                               |
+| Boolean       | boolean   | `true` or `false`                                        | `true`                               |
+| Exact Numeric | number    | Digits in fraction must match scale, no exponent         | `34`, 34.00                          |


I think maybe the wording or presentation of this mapping is a bit confusing.

I think we are on all on the same page of allowing engines to "normalize" the Variant value. For example, I think the Spark implementation already normalizes 1.00 to 1. There are also many optimizations and efficiency aspects with normalization, so we should not disallow that.

Maybe what this chart is trying to show is: "if you want to output a Variant value as a JSON string, this is the output format you should use". So, for numbers, the conversion should be like 1 or 1.23 (no quotes), not "1", or "1.23". If this chart was about the JSON output formatting, would that be more clear?

VariantShredding.md

gene-db · 2024-11-25T19:00:34Z

VariantShredding.md

-By the same token, `variant_value` may be absent, which is equivalent to their value being always null (in which case the field will always have the value Null or have the type of the `typed_value` column).
+| Variant Type                | Equivalent Parquet Type      |
+|-----------------------------|------------------------------|
+| boolean                     | BOOLEAN                      |


I think the state of this is that the storage will not do any conversions at all. However, the engine itself is allowed to "normalize" variants to optimize. In this case, engines will probably normalize within "exact numerics" in order to make shredding more effective.

emkornfield · 2024-12-04T07:00:58Z

VariantEncoding.md

+optional group shredded_variant_name (VARIANT) {
+  required binary metadata;
+  optional binary value;
+  optional int64 typed_value;


Suggested change

optional int64 typed_value;

// The exact semantics of this field are discussed in detail below, but this column stores the variant value when it is an integer.

optional int64 typed_value;

VariantShredding.md

emkornfield · 2024-12-04T07:45:15Z

VariantShredding.md


-Consider the following example:
+Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).


Suggested change

Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).

Min/Max statistics for `typed_value` columns may be used for row group and page skipping when `value` is always null (missing). The statistics may also be used if statistics can be used to infer the type of all non-null values are irrelevant to the min/max statistics (e.g. if all values are variant null (`00`) and a filter specifies equality to a integer).

I don't think that we should make this more specific because anything further risks making assumptions about the semantics of the filter. The example on the dev list demonstrates this. This can't state that strings can be ignored for numeric filters because the filter may cast a string and result in a valid value.

VariantShredding.md

### What changes were proposed in this pull request? It implements the variant rebuild functionality according to the current shredding spec in apache/parquet-format#461, and allows the Parquet reader will be able to read shredded variant data. ### Why are the changes needed? It gives Spark the basic ability to read shredded variant data. It can be improved in the future to read only requested fields. ### Does this PR introduce _any_ user-facing change? Yes, the Parquet reader will be able to read shredded variant data. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48851 from chenhao-db/rebuild_variant. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The `variantShreddingSchema` method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the `variantShreddingSchema` to mark that struct as non-nullable. ### Why are the changes needed? If we use `variantShreddingSchema` to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level. ### Does this PR introduce _any_ user-facing change? No, this code is not used yet. ### How was this patch tested? Added a test to do some minimal validation of the `variantShreddingSchema` function. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49118 from cashmand/SPARK-48898-nullability. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

emkornfield · 2024-12-11T21:53:18Z

VariantShredding.md


-At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`.
-These represent a fixed schema suitable for constructing the full Variant value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant.


Suggested change

For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant.

A partial projection occurs with a query like `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl`. In this case an engine only needs to load field `event_ts`, and if `event_ts` column is shredded, it can be read without reading or deserializing the rest of the `event` Variant.

emkornfield · 2024-12-11T22:03:40Z

VariantShredding.md


-For an `object`, a null field means that the field does not exist in the reconstructed Variant object.
-All elements of an `array` must be non-null, since array elements cannote be missing.
+If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.


Suggested change

If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.

If both fields are non-null and either is not an object, the value is invalid. Reader behavior is undefined[^2].

[^2] The correct thing to do in this situation is rewrite the data file to fix the inconsistency. The correct way of rewriting the file can only be determined based on the bug that produced the inconsistency. Readers are encouraged to error this case to help identify writer bugs as soon as possible. Erroring might not be appropriate or possible given reader optimizations. If a reader does not error in this case, then favoring the shredded values would provide more consistent results (i.e. it works across both partial-projection and full reconstruction) but this does not make using the shredded values correct, simply more consistent. Similarly, any other type of "read side fix" is discouraged, but might be used by specific readers to help their consumers until data can be rewritten.

@rdblue @julienledem I think this language might capture all of the concerns across (I think this wording applies to all potential incosistencies).

@rdblue and I also talked about this for a long time and I think I favor the current text. I feel like the additional text adds a bit of confusion around this.

A shredded reading of a field is always correct since a shredded reader will not be able to check an unshreddeed value for inconsistency, a reader using an un-shredded value when the shredded value is present is always incorrect.

@rdblue and I also talked about this for a long time and I think I favor the current text. I feel like the additional text adds a bit of confusion around this.

We also talked about it in the sync and didn't come to the conclusion. IIUC @rdblue wanted the error handling language to eliminate the possibility of trying to define an alternative behavior down the road by one specific reader. I don't think we should be mandating this in the spec but I do agree we should be clarifying this won't be relitigated.

A shredded reading of a field is always correct since a shredded reader will not be able to check an unshreddeed value for inconsistency, a reader using an un-shredded value when the shredded value is present is always incorrect.

I think semantics are important here. I tried to cover this by saying it is "consistent", i.e. it would always provided consistent results to the end user, which is a nice property to have and I agree most implementations should use it if they aren't going to error out.

"correct", I think, is a property of actually returning the the variant data without modification. Without understanding the bugs that introduced inconsistent shredding, I don't think it is possible say for sure the shredded values are correct.

The reason why I think it is important to say it is not "correct" is because in other instances (e.g. requiring no overlap between shredded/not shredded fields) to write out of spec files that would not face any issues (as an alternative I think it would OK to have overlapping values as long as they are consistent, then the real question becomes what happens if the values are not actually consistent, in this case I think it would really be very hard to understand what the correct results are).

As an analogy from another part of parquet (and a real world example I've encounted). Assume a schema like list<struct<required a int, required b int>> we do not require readers to check that repetition levels and definition levels are consistent for columns a and b (and many don't). If they are not equal it is a bug, we can't really say which one, or if either is, "correct" without understanding the bug that introduced the inconsistency. A "consistent" result would be to always use the left most columns repetition and definition levels.

To maybe go a little bit deeper here, I think it is worth enumerating a matrix of potential operations and data inconsistencies (see below). The main purpose of these examples is to show that:

Requiring using typed_value or failure pessimizes some cases or requires that different operations will return inconsistent result (making the value of trying to force consistency in some cases have less utility).

Preferring typed_value in the presence of some inconsistencies sseems like an arbitrary choice simply for the sake of not exposing bugs in the underlying data.

I think the way of rectifying this is either to add more cases that are considered "valid" (e.g. shredded object fields might overlap with fields in value as long as they are consistent) and/or leave the behavior undefined. My preference would simply be to do the latter.

I think there are three main operations that will be performed with Variant:

Project as primitive non-strict - Return a field as a specific primitive value, if a field mismatches the type return null in its place.

Project as primitive strict - Return a field specific primitive value if a field mismatches the type the query is failed.

Project as variant - project a field as a variant type allowing for mixed types to be returned (this would also be projecting the top level variant to get the original value).

Primitive inconsistencies

Assume for all these cases a column is shredded as int32 as its typed value.

Inconsistency: Both typed_value and value are present for the same cell and the values are consistent with one another (e.g. all values in value column are int32 and exactly equal to those in typed_value

Operation: Project field as int32 non-strict. In this case engines would prefer to always use typed_value and never need to read value

Operation: Project field as string not-strict. In this case only value column would need to be read but would return consistent results as using typed_value

Operation: Project field as int32 as strict. If it is required that a field that can be projected always is (IIUC this isn't currently mandated by the spec), then this operation should fail on statististics as there would be a non-null value column. Otherwise this effectively requires the same logic projecting a variant described below.

Operation: Project field as string strict. This would fail based on stastistics from typed_value.

Operation: Project field as variant. In this case both typed_value and value would need to be read and merged. When merging the reader could chose to always take the non-null value from typed_value ignoring any present values in value (assuming corresponding values are null). This would however lead to a strange state where after all values are merged, fewer cells from value would be read then are present. An alternative would be check for consistency between repetition/definition levels, and realize they are both defined and still favor typed_value, or check for consistency and continue reading as long as value and typed_value are consistent.

Inconsistency: Both type_value and value are present for the same cell and are inconsistent with one another (i.e. value contains a string).

Operation: Project field as int32 non-strict. In this case engines would prefer to always use typed_value and never need to read value. int32 value is returned over the string value

Operation: Project field as string not-strict. In this case only value column would need to be read but would return inconsistent string value in the cells that conflict.

Operation: Project field as int32 as strict. Same as data inconsistency above.

Operation: Project field as string as strict. Same as data inconsistency above (fail based on statistics).

Operation: Project as variant. Same as data inconsistency above, but if the reader is checking for consistency between values to decide on what to do, as the current spec is written it would arbitrarily choose int32 value (valid alternatives seem like failure or returning the string value.

Object inconsistencies

In this case assume there the object is shredded with a single child field "a" as int32. Objects overlap with the cases outlined above for consistency/inconsistency so they are not covered here, only somewhat novel cases are discussed below (merging is taken one step further because field overlap either must be detected or duplicate fields could be added when reconstructing the full variant).

Inconsistency: The value object contains a non-object value (e.g. string) in the same cell that has a present value for the shredded column "a".

Operation: Project as as variant. In this case as I read the spec, it is valid for readers to return {a: <some int value>}. Other valid options seem like (use the non-object value or fail).

Inconsistency: The typed_value object is marked as non-present but the value column is present and contains object.

Operation: Project field as object only keeping fields "a" non-strict. In this case readers would ignore the value column based on the spec and return null for the the inconsistency fields.

Operation: Project field only keep field "c" as variant. In this case the reader should be able to ignore typed_value (it only contains an object "a" as a shredded column so it can be ignored) and return {"c": <some value>} but that is not what the specification states.

### What changes were proposed in this pull request? The variantShreddingSchema method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the variantShreddingSchema to mark that struct as non-nullable. ### Why are the changes needed? If we use variantShreddingSchema to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level. ### Does this PR introduce _any_ user-facing change? No, this code is not used yet. ### How was this patch tested? Added a test to do some minimal validation of the variantShreddingSchema function. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49151 from cashmand/SPARK-48898-nullability-again. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cashmand · 2024-12-13T19:32:35Z

VariantShredding.md

-On the other hand, shredding as a different logical type is not allowed.
-For example, the integer value 123 could not be shredded to a string `typed_value` column as the string "123", since that would lose type information.
-It would need to be written to the `variant_value` column.
+Each shredded field in the `typed_value` group is represented as a required group that contains optional `value` and `typed_value` fields.


Are there any cases where a field's name might not be a valid Parquet group name, making it impossible to shred? E.g. are there any restrictions on whitespace or non-printable characters in Parquet? Or any that should be discouraged for shredding, because readers are known to have trouble with them?

There are no specific restrictions in the spec that I can recall, there are certainly engines that don't support some characters but shredded columns should hopefully never reach engine schemas since only the top level variant would ideally be displayed. For non-displayable characters general display and SQL projection functions would likely become cumbersome.

Some reference implementations might have issues with periods (.) but from an API perspective it isn't clear to me if there should be a separate API anyways to interrogate if certain variants are shredded. This might be a design question to raise on the mailing list before beginning implementation.

### What changes were proposed in this pull request? The variantShreddingSchema method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the variantShreddingSchema to mark that struct as non-nullable. ### Why are the changes needed? If we use variantShreddingSchema to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level. ### Does this PR introduce _any_ user-facing change? No, this code is not used yet. ### How was this patch tested? Added a test to do some minimal validation of the variantShreddingSchema function. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49151 from cashmand/SPARK-48898-nullability-again. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Zouxxyy · 2024-12-24T07:51:22Z

VariantShredding.md

- }
+required group measurement (VARIANT) {
+  required binary metadata;
+  optional binary value;


Hi @rdblue , I have thought of a specific scenario: suppose the json is a struct and has 4000 kvs, of which 100 keys intend to construct typed_value. The remaining 3900 columns will not be shredded. In this case, they will all be stored in value.

Is it possible for us to define the max number of keys contained in one binary value, such as in this case, storing a maximum of 1000, so that v1, v2, v3, v4 can be generated to reduce scan IO.

Futher more, we can define the concept of group and stipulate that some certain keys are divided into one binary value1.

@Zouxxyy could you expand on this use-case? Is there a specific query pattern for this type of division would improve performance for?

@emkornfield For the cols that we do not intend to shred, if we categorize them into four groups and happen to query fields within one of those groups, then our scan I/O can still be reduced by three-quarters.
Of course, I feel that the concept of group will make shredding more complex (currently, it is concise and clear enough), I'm just looking to explore the possibilities and hear opinions.

rdblue force-pushed the variant-updates branch 2 times, most recently from c4b435f to 8352319 Compare October 20, 2024 22:22

rdblue commented Oct 20, 2024

View reviewed changes

VariantEncoding.md Show resolved Hide resolved

rdblue commented Oct 20, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

rdblue commented Oct 20, 2024

View reviewed changes

gene-db reviewed Oct 21, 2024

View reviewed changes

VariantEncoding.md Outdated Show resolved Hide resolved

rdblue mentioned this pull request Oct 21, 2024

GH-459: Add Variant logical type annotation #460

Merged

sfc-gh-saya reviewed Oct 22, 2024

View reviewed changes

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

sfc-gh-saya reviewed Oct 24, 2024

View reviewed changes

VariantEncoding.md Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 24, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

rdblue commented Oct 24, 2024

View reviewed changes

rdblue changed the title ~~WIP: Current work on Variant specs~~ Simplify Variant shredding and refactor for clarity Oct 24, 2024

gene-db reviewed Oct 25, 2024

View reviewed changes

cashmand reviewed Oct 28, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

cashmand reviewed Nov 5, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 5, 2024

View reviewed changes

RussellSpitzer reviewed Nov 6, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 6, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

emkornfield reviewed Nov 23, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

emkornfield reviewed Nov 24, 2024

View reviewed changes

gene-db reviewed Nov 25, 2024

View reviewed changes

Update from latest review comments.

5cdd682

emkornfield reviewed Dec 4, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

emkornfield reviewed Dec 4, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

aihuaxu mentioned this pull request Dec 8, 2024

Spec: add variant type apache/iceberg#10831

Open

cashmand reviewed Dec 9, 2024

View reviewed changes

VariantShredding.md Show resolved Hide resolved

cashmand mentioned this pull request Dec 9, 2024

[SPARK-48898][SQL] Set nullability correctly in the Variant schema apache/spark#49118

Closed

rdblue added 3 commits December 10, 2024 13:58

Handle infinities and NaN in the JSON encoding.

2858477

Specify columns accessed by name, not position.

4f5611d

Clarifications.

413139a

cashmand mentioned this pull request Dec 11, 2024

[SPARK-48898][SQL] Set nullability correctly in the Variant schema apache/spark#49151

Closed

emkornfield reviewed Dec 11, 2024

View reviewed changes

cashmand reviewed Dec 13, 2024

View reviewed changes

Zouxxyy reviewed Dec 24, 2024

View reviewed changes


		Similarly the elements of an `array` must be a group containing one or more of `object`, `array`, `typed_value` or `variant_value`.
		Each shredded field is represented as a required group that contains a `variant_value` and a `typed_value` field.

	If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
	If both fields are non-null and either is not an object, the `value` is invalid. Readers must either fail or return the `typed_value`.

	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, 34.00 \|
	\| Exact Numeric \| number \| Digits in fraction must match scale, no exponent \| `34`, `34.00` \|

	Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant.
	Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping while the rest of the Variant is lazily loaded for matching pages.


		The `typed_value` associated with any Variant `value` field can be any shredded type according to the rules above.


		Consider the following example:
		Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).

	If shredded fields are present in the variant object, it is invalid and readers must either fail or use the shredded values.
	Readers are not required to check for overlap of between the fields in a `value` object and the `typed_value` schema. If there is an overlap behavior of the reader is undefined.

	optional int64 typed_value;
	// The exact semantics of this field are discussed in detail below, but this column stores the variant value when it is an integer.
	optional int64 typed_value;

	Statistics for `typed_value` columns can be used for file, row group, or page skipping when `value` is always null (missing).
	Min/Max statistics for `typed_value` columns may be used for row group and page skipping when `value` is always null (missing). The statistics may also be used if statistics can be used to infer the type of all non-null values are irrelevant to the min/max statistics (e.g. if all values are variant null (`00`) and a filter specifies equality to a integer).

	For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant.
	A partial projection occurs with a query like `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl`. In this case an engine only needs to load field `event_ts`, and if `event_ts` column is shredded, it can be read without reading or deserializing the rest of the `event` Variant.

-If both fields are non-null and either is not an object, the value is invalid. Readers must either fail or return the `typed_value`.
+If both fields are non-null and either is not an object, the value is invalid.  Reader behavior is undefined[^2].
+[^2] The correct thing to do in this situation is rewrite the data file to fix the inconsistency. The correct way of rewriting the file can only be determined based on the bug that produced the inconsistency.  Readers are encouraged to error this case to help identify writer bugs as soon as possible.  Erroring might not be appropriate or possible given reader optimizations.  If a reader does not error in this case, then favoring the shredded values would provide more consistent results (i.e. it works across both partial-projection and full reconstruction) but this does not make using the shredded values correct, simply more consistent.  Similarly, any other type of "read side fix" is discouraged, but might be used by specific readers to help their consumers until data can be rewritten.

Simplify Variant shredding and refactor for clarity #461

Are you sure you want to change the base?

Simplify Variant shredding and refactor for clarity #461

Conversation

rdblue commented Oct 20, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Primitive inconsistencies

Inconsistency: Both typed_value and value are present for the same cell and the values are consistent with one another (e.g. all values in value column are int32 and exactly equal to those in typed_value

rdblue commented Oct 20, 2024 •

edited

Loading

rdblue Oct 24, 2024 •

edited

Loading

emkornfield Dec 4, 2024 •

edited

Loading

emkornfield Dec 11, 2024 •

edited

Loading

Inconsistency: Both typed_value and value are present for the same cell and the values are consistent with one another (e.g. all values in `value` column are int32 and exactly equal to those in `typed_value`

Inconsistency: Both type_value and value are present for the same cell and are inconsistent with one another (i.e. `value` contains a string).