Spec: add variant type #10831

aihuaxu · 2024-07-31T23:22:06Z

Spec: add variant type

Proposal: https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit

This is to layout the spec for variant type. The specs are placed in Parquet project (see variant spec and shredding spec.

Fix: #10392

aihuaxu · 2024-07-31T23:37:24Z

cc @rdblue, @RussellSpitzer and @flyrain

format/spec.md

format/variant-shredding-spec.md

RussellSpitzer · 2024-08-01T15:26:42Z

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

sfc-gh-aixu · 2024-08-01T16:30:08Z

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

Definitely. It's not for merge yet. I'm mostly trying to get the comments in place. Make sense to move that to google doc and link here.

format/spec.md

RussellSpitzer · 2024-10-17T18:28:53Z

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

RussellSpitzer · 2024-10-17T18:29:37Z

And an entry https://github.com/apache/iceberg/blob/main/format/spec.md#parquet

aihuaxu · 2024-10-18T18:50:58Z

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

Thanks @RussellSpitzer I missed those sections and just updated.

I mark Partition Transforms, sorting and hashing not supported/allowed for now.
For Appendix C, I think it should be just variant, similar to primitive type, since it's Iceberg schema as I understand the section.

format/spec.md

rdblue · 2024-10-24T21:39:55Z

format/spec.md

@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < `

 A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

+Note:
+
+1. The ability to sort `variant` columns and the specific sort order is determined by the engines.


Do we need this? I think anything we don't specify is up to engines already.

OK. I will remove that then. Do we need to call out "Variant values cannot be present in an Iceberg sort order"?

I think we should specifically forbid sort orders containing a variant. I think we actually are underdetermined in the spec here.

We have the following checks in the Reference Implementation

iceberg/api/src/main/java/org/apache/iceberg/SortOrder.java

Lines 301 to 311 in 8a16a41

ValidationException.check(

sourceType != null, "Cannot find source column for sort field: %s", field);

ValidationException.check(

sourceType.isPrimitiveType(),

"Cannot sort by non-primitive source field: %s",

sourceType);

ValidationException.check(

field.transform().canTransform(sourceType),

"Invalid source type %s for transform: %s",

sourceType,

field.transform());

So currently, even though we don't specify this here, you cannot make a sort order with array or map. I think we should explicitly call this out and add variant as well. My real concern here is that we add the ability to sort on something but don't define what that sorting actually looks like.

format/spec.md

RussellSpitzer · 2024-11-01T18:50:47Z

format/spec.md

@@ -1436,6 +1457,7 @@ This serialization scheme is for storing single values as individual binary valu
 | **`struct`**                 | Not supported                                                                                                |
 | **`list`**                   | Not supported                                                                                                |
 | **`map`**                    | Not supported                                                                                                |
+| **`variant`**                | Not supported                                                                                                |


I do agree this should be not-supported for now. Then when shredding is included say something like for Shredded variants only, binary value concatenation of metadata and value + separator byte or something. We can figure that out with the shredding addition though

If we don't include Variant here, then we don't need to include it in the JSON section either.

Seems binary representation is used for lower bound and upper bound and JSON single-value serialization is used for default value. Looks like they are defined independently? But since there is no default value for Variant, i will remove from JSON section.

format/spec.md

rdblue · 2024-11-21T19:38:07Z

format/spec.md

@@ -178,6 +178,21 @@ A **`list`** is a collection of values with some element type. The element field

 A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3.


This link should be to main/master rather than a specific sha right?

This I worried about, since aren't we syncing with a specific iteration of the file?

Basically we link to a specific version to be implemented in Iceberg. Then later e.g., when we add additional data types, we should also update here.

I think linking to a specific version of the file is not very clear on what is intended. We should be very specific in this doc what parts are intended for support in iceberg v3

I think we don't want to duplicate the content the actual spec in Parquet. Basically what mentioned in the parquet spec should be included.

To be more specific I think we should be saying only V1 of the Parquet variant spec is included (i.e. not try to address by specific link but by a specific version from parquet).

Agree that eventually I think we would do that when we start to release that in Parquet. Since it's in progress right now, I will link like this and will update later.

format/spec.md

rdblue · 2024-11-21T19:39:57Z

format/spec.md

@@ -444,7 +459,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ

 | Transform name    | Description                                                  | Source types                                                                                              | Result type |
 |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
-| **`identity`**    | Source value, unmodified                                     | Any                                                                                                       | Source type |
+| **`identity`**    | Source value, unmodified                                     | Any other than `variant`                                                                                  | Source type |


What out "except for"? I think that is more clear than "other than"

Should we also fix the bucket function definition by using an except for list?

Make sense. Let me fix in the following PR.

Should we also fix the bucket function definition by using an except for list?

rdblue

Looks good to me other than a few minor comments.

Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit

emkornfield · 2024-11-22T23:10:09Z

format/spec.md

@@ -1154,6 +1169,7 @@ Maps with non-string keys must use an array representation with the `map` logica
 |**`struct`**|`record`||
 |**`list`**|`array`||
 |**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.|
+|**`variant`**|`record` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs. |Shredding is not supported in Avro.|


we should probably be consistent here and make the note that field IDs should not be assigned to the fields across all formats?

I added the same note for parquet and ORC format. I see ORC use ICEBERG_ID_ATTRIBUTE to track the fieldId but the concept seems to be similar.

emkornfield · 2024-11-26T17:36:07Z

format/spec.md

+As a semi-structured type, there are important differences between variant and Iceberg's other types:
+1. Variant arrays are similar to lists, but may contain any variant value rather than a fixed element type.
+2. Variant objects are similar to structs, but may contain variable fields identified by name and field values may be any variant value rather than a fixed field type.
+3. Variant primitives are narrower than Iceberg's primitive types: time, timestamp_ns, timestamptz_ns, uuid, and fixed(L) are not supported.


I imagine the parquet changes to the variant spec would be merged before we release v3?

Are you talking about the Variant spec change apache/parquet-format#461 and apache/parquet-format#464? I think we will.

yes, mostly 464. Is there a reason not to support fixed(L)? I suppose it is redundant with string?

Yes. We can use string instead. Also, we need to think of how to represent fixed(L) of length L if we want to support it. We need to encode L in the type if we don't lose such information while our type field only has 5 bits.

I think for Fixed(L) using the same representation of String works (there can be multiple L encoded in a variant), so size == L this should give enough flexibility (so the type would just be "Fixed") I suppose one could get more complex then this but I'm not sure there is value given how things are currently modelled.

I'm assuming that for regular fixed(L) data type, when we insert a string, then it will truncate to length L if it's too long.

If we don't have the length L, then you mean we just store the original strings which could be longer than the defined L? Then I guess we are not actually supporting, e.g., fixed(16) type, which what I assume the expected behavior is: truncate string to 16 bytes when we store the string and return fixed(16) when we read back.

If the type is fixed only, then we will not read back/write to as fixed(16)? Let me know if I misunderstand.

I'm assuming that for regular fixed(L) data type, when we insert a string, then it will truncate to length L if it's too long.

Yes, i think this might be a different assumptions. My assumption is that engines pass in valid Fixed(L) (i.e. length of this type are always exactly L). Their would be no truncation should happen at the storage layer.

So the encoding would be <value header fixed><L as int32 (maybe int16 is sufficient)><string value>

Yeah. I think that works. Even with truncation, that should happen on the engine side.

We can consider adding that to the spec if there is a need.

findepi · 2024-12-05T13:43:59Z

format/spec.md

@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element type. The element field

 A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3.


From the linked document

Important

This specification is still under active development, and has not been formally adopted.

assuming Parquet-level spec if subject to change, what are the conditions to release Iceberg 3 with variant support in the spec?

Secondly, the linked document talks about shredding.
How does this interact with Iceberg field IDs in the parquet metadata?
Do all the columns share field ID, or is only the first column supposed to be annotated with the field ID?
Let's make it explicit.

From previous discussion, the community is interested in both basic Variant type support and shredding for better performance. I can see basic variant encoding is settled - we could add additional types; I think we need finalize the shredding spec so the encoding doesn't change.

Regarding shredding, are you referring to shredded subcolumns from a Variant? I'm thinking that we can clarify in shredding spec (probably after https://github.com/apache/parquet-format/pull/461/files#diff-95f43ac21fdadae78c95da23444ed7a4036a4993e9faa2ee5d8b2c29ef6d8056). The top variant column has the field ID and the subcolumns are accessed through the path like location.lattitude.

Secondly, the linked document talks about shredding.
How does this interact with Iceberg field IDs in the parquet metadata?
Do all the columns share field ID, or is only the first column supposed to be annotated with the field ID?
Let's make it explicit.

findepi · 2024-12-05T13:45:47Z

format/spec.md

+
+A **`variant`** is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3.
+
+Variants are similar to JSON with a wider set of primitive values including date, timestamp, timestamptz, binary, and floating points.


if this documents the difference from json, let's skip floats and add decimals
if this documents all types variant supports, let's add integers, decimals, string

Updated to add decimals and remove floats.

findepi · 2024-12-05T13:53:35Z

format/spec.md

@@ -1208,6 +1224,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
 | **`struct`**       | `group`                                                            |                                             |                                                                |
 | **`list`**         | `3-level list`                                                     | `LIST`                                      | See Parquet docs for 3-level representation.                   |
 | **`map`**          | `3-level map`                                                      | `MAP`                                       | See Parquet docs for 3-level representation.                   |
+| **`variant`**      | `group` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs.| `VARIANT`                                   | See Parquet docs for Variant encoding and Variant shredding encoding. |


If these don't have field ID, how should the reader locate the contents of a variant iceberg field?
are these mapped by name?

maybe that's obvious, but let's make it explicit.

For variant type groups in Parquet, they are expected to have fixed value and metadata fields and they are read through the names. Let me add that.

XBaith · 2024-12-11T09:55:23Z

format/spec.md

@@ -1208,6 +1224,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
 | **`struct`**       | `group`                                                            |                                             |                                                                |
 | **`list`**         | `3-level list`                                                     | `LIST`                                      | See Parquet docs for 3-level representation.                   |
 | **`map`**          | `3-level map`                                                      | `MAP`                                       | See Parquet docs for 3-level representation.                   |
+| **`variant`**      | `group` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs and the fields are accessed through names.| `VARIANT`                                   | See Parquet docs for Variant encoding and Variant shredding encoding. |


It might be helpful to include a link to the documentation for easier reference. What do you think?

See Parquet docs for Variant encoding and Variant shredding encoding.

Sure. There is a discussion if I need to link to the files on the main branch or a particular commit above. For now, I will link to the ones on a commit to reflect the current state.

github-actions bot added the Specification Issues that may introduce spec changes. label Jul 31, 2024

aihuaxu force-pushed the variant-type-spec branch from b868ea6 to 1a0404b Compare July 31, 2024 23:24

flyrain reviewed Aug 1, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

flyrain reviewed Aug 1, 2024

View reviewed changes

format/variant-shredding-spec.md Outdated Show resolved Hide resolved

aihuaxu force-pushed the variant-type-spec branch from e51c8e6 to 5a8acf1 Compare October 9, 2024 20:59

aihuaxu marked this pull request as ready for review October 9, 2024 20:59

aihuaxu force-pushed the variant-type-spec branch from 5a8acf1 to 408ad2d Compare October 9, 2024 21:03

RussellSpitzer reviewed Oct 10, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 10, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Oct 10, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

aihuaxu force-pushed the variant-type-spec branch from 285d009 to f7adbbc Compare October 17, 2024 15:53

aihuaxu requested a review from RussellSpitzer October 17, 2024 15:53

aihuaxu force-pushed the variant-type-spec branch from f7adbbc to 3e91ce9 Compare October 17, 2024 16:05

aihuaxu requested review from flyrain and sfc-gh-aixu October 17, 2024 17:22

aihuaxu force-pushed the variant-type-spec branch from 3e91ce9 to 0bc975e Compare October 18, 2024 18:43

RussellSpitzer reviewed Oct 18, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

flyrain reviewed Oct 18, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

aihuaxu requested review from RussellSpitzer and flyrain October 21, 2024 18:23

aihuaxu force-pushed the variant-type-spec branch from 6673520 to 3aabac4 Compare October 21, 2024 20:46

rdblue reviewed Oct 24, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Oct 24, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

aihuaxu force-pushed the variant-type-spec branch from d953b6e to 67df611 Compare October 29, 2024 06:24

sfc-gh-rspitzer reviewed Nov 1, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 1, 2024

View reviewed changes

format/spec.md Show resolved Hide resolved

RussellSpitzer reviewed Nov 1, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

aihuaxu force-pushed the variant-type-spec branch 2 times, most recently from c8f9e7e to 0550fcf Compare November 5, 2024 17:58

aihuaxu requested review from RussellSpitzer and sfc-gh-rspitzer November 5, 2024 17:59

rdblue reviewed Nov 21, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Nov 21, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Nov 21, 2024

View reviewed changes

rdblue approved these changes Nov 21, 2024

View reviewed changes

sfc-gh-aixu and others added 5 commits November 22, 2024 09:23

Spec: add variant type

ac95432

Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit

Address comments

1c395ba

Update sorting/bucketing/json sections for variant

8be7de1

Update sort info

ab122ea

Update the Variant description and add Avro/ORC type mappings

40c3714

aihuaxu force-pushed the variant-type-spec branch from 0550fcf to ad3a14b Compare November 22, 2024 19:12

emkornfield reviewed Nov 22, 2024

View reviewed changes

aihuaxu force-pushed the variant-type-spec branch 2 times, most recently from cd1e8d3 to ab3b0e1 Compare November 24, 2024 05:20

aihuaxu requested a review from emkornfield November 24, 2024 05:22

emkornfield reviewed Nov 26, 2024

View reviewed changes

findepi reviewed Dec 5, 2024

View reviewed changes

aihuaxu force-pushed the variant-type-spec branch from ab3b0e1 to fd79e4d Compare December 8, 2024 21:41

XBaith reviewed Dec 11, 2024

View reviewed changes

Address comments

a472370

aihuaxu force-pushed the variant-type-spec branch from fd79e4d to a472370 Compare December 14, 2024 15:30

	ValidationException.check(
	sourceType != null, "Cannot find source column for sort field: %s", field);
	ValidationException.check(
	sourceType.isPrimitiveType(),
	"Cannot sort by non-primitive source field: %s",
	sourceType);
	ValidationException.check(
	field.transform().canTransform(sourceType),
	"Invalid source type %s for transform: %s",
	sourceType,
	field.transform());


		A `variant` is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3.

		Variants are similar to JSON with a wider set of primitive values including date, timestamp, timestamptz, binary, and floating points.

Spec: add variant type #10831

Are you sure you want to change the base?

Spec: add variant type #10831

Conversation

aihuaxu commented Jul 31, 2024 • edited Loading

aihuaxu commented Jul 31, 2024

RussellSpitzer commented Aug 1, 2024

sfc-gh-aixu commented Aug 1, 2024

RussellSpitzer commented Oct 17, 2024

RussellSpitzer commented Oct 17, 2024

aihuaxu commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aihuaxu commented Jul 31, 2024 •

edited

Loading

RussellSpitzer Nov 1, 2024 •

edited

Loading