Skip to content

Commit

Permalink
[Protocol Change Request] Type Widening table feature (delta-io#2624)
Browse files Browse the repository at this point in the history
* Type Widening Protocol RFC

* Clarified requirements

* Add requirement on garbage collecting type change metadata

* Update protocol_rfcs/type-widening.md

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

* Update protocol_rfcs/type-widening.md

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

* Update protocol_rfcs/type-widening.md

Co-authored-by: Bart Samwel <bart.samwel@databricks.com>

* Clarify reader & writer requirements re: unsupported type changes

* Fix metadata example: s/int/integer

---------

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>
Co-authored-by: Bart Samwel <bart.samwel@databricks.com>
  • Loading branch information
3 people authored Feb 27, 2024
1 parent 7d41fb7 commit 5d25578
Show file tree
Hide file tree
Showing 2 changed files with 149 additions and 0 deletions.
1 change: 1 addition & 0 deletions protocol_rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
| Date proposed | RFC file | Github issue | RFC title |
|:-|:-|:-|:-|
| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps |
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |

### Accepted RFCs

Expand Down
148 changes: 148 additions & 0 deletions protocol_rfcs/type-widening.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Type Widening
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2623**

This protocol change introduces the Type Widening feature, which enables changing the type of a column or field in an existing Delta table to a wider type.

--------

# Type Widening
> ***New Section after the [Clustered Table](#clustered-table) section***
The Type Widening feature enables changing the type of a column or field in an existing Delta table
to a wider type.

The **supported type changes** are:
- Integer widening: `Byte` -> `Short` -> `Int` -> `Long`
- Floating-point widening: `Float` -> `Double`
- Decimal widening: `Decimal(p, s)` -> `Decimal(p + k1, s + k2)` where `k1 >= k2 >= 0`. `p` and `s` denote the decimal precision and scale respectively.
- Date widening: `Date` -> `Timestamp without timezone`

To support this feature:
- The table must be on Reader version 3 and Writer Version 7.
- The feature `typeWidening` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`, either during its creation or at a later stage.

When supported:
- A table may have a metadata property `delta.enableTypeWidening` in the Delta schema set to `true`. Writers must reject widening type changes when this property isn't set to `true`.
- The `metadata` for a column or field in the table schema may contain the key `delta.typeChanges` storing a history of type changes for that column or field.

### Type Change Metadata

Type changes applied to a table are recorded in the table schema and stored in the `metadata` of their nearest ancestor [StructField](#struct-field) using the key `delta.typeChanges`.
The value for the key `delta.typeChanges` must be a JSON list of objects, where each object contains the following fields:
Field Name | optional/required | Description
-|-|-
`tableVersion`| required | The version of the table when the type change was applied.
`fromType`| required | The type of the column or field before the type change.
`toType`| required | The type of the column or field after the type change.
`fieldPath`| optional | When updating the type of a map key/value or array element only: the path from the struct field holding the metadata to the map key/value or array element that was updated.

The `fieldPath` value is "key", "value" and "element" when updating resp. the type of a map key, map value and array element.
The `fieldPath` value for nested maps and nested arrays are prefixed by their parents's path, separated by dots.

The following is an example for the definition of a column that went through two type changes:
```json
{
"name" : "e",
"type" : "long",
"nullable" : true,
"metadata" : {
"delta.typeChanges": [
{
"tableVersion": 1,
"fromType": "short",
"toType": "integer"
},
{
"tableVersion": 5,
"fromType": "integer",
"toType": "long"
}
]
}
}
```

The following is an example for the definition of a column after changing the type of a map key:
```json
{
"name" : "e",
"type" : {
"type": "map",
"keyType": "double",
"valueType": "integer",
"valueContainsNull": true
},
"nullable" : true,
"metadata" : {
"delta.typeChanges": [
{
"tableVersion": 2,
"fromType": "float",
"toType": "double",
"fieldPath": "key"
}
]
}
}
```

The following is an example for the definition of a column after changing the type of a map value nested in an array:
```json
{
"name" : "e",
"type" : {
"type": "array",
"elementType": {
"type": "map",
"keyType": "string",
"valueType": "decimal(10, 4)",
"valueContainsNull": true
},
"containsNull": true
},
"nullable" : true,
"metadata" : {
"delta.typeChanges": [
{
"tableVersion": 2,
"fromType": "decimal(6, 2)",
"toType": "decimal(10, 4)",
"fieldPath": "element.key"
}
]
}
}
```

## Writer Requirements for Type Widening

When Type Widening is enabled (when the table property `delta.enableTypeWidening` is set to `true`), then:
- Writers must reject applying any **unsupported type change**.
- Writers should allow updating the table schema to apply a **supported type change** to a column, struct field, map key/value or array element.
- Writers must record type change information in the `metadata` of the nearest ancestor [StructField](#struct-field). See [Type Change Metadata](#type-change-metadata).

When Type Widening is supported (when the `writerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
- Writers must preserve the `delta.typeChanges` field in the metadata fields in the schema when a schema is updated.
- Writers can remove an element from a `delta.typeChanges` field in the metadata fields in the schema when all active `add` actions in the latest version of the table have a `defaultRowCommitVersion` value that is not NULL and that is greater or equal to the `tableVersion` value of that `delta.typeChanges` element.
- Writers must set the `defaultRowCommitVersion` field in new `add` actions to the version number of the log enty containing the `add` action.
- Writers must set the `defaultRowCommitVersion` field in recommitted and checkpointed `add` actions and `remove` actions to the `defaultRowCommitVersion` of the last committed `add` action with the same `path`.

The last two requirements related to `defaultRowCommitVersion` are a subset of the requirements from [Writer Requirements for Row Tracking](#writer-requirements-for-row-tracking) that may be implemented separately without introducing a dependency on the [Row Tracking](#row-tracking) table feature.

## Reader Requirements for Type Widening
When Type Widening is supported (when the `readerFeatures` field of a table's `protocol` action contains `enableTypeWidening`), then:
- Readers must allow reading data files written before the table underwent any **supported type change**, and must convert such values to the current, wider type.
- Readers must validate that type changes in the `delta.typeChanges` field in the table schema for the table version they are reading are supported and fail when finding any **unsupported type change**.

### Column Metadata
> ***Change to existing section (underlined)***
A column metadata stores various information about the column.
For example, this MAY contain some keys like [`delta.columnMapping`](#column-mapping) or [`delta.generationExpression`](#generated-columns) or [`CURRENT_DEFAULT`](#default-columns).
Field Name | Description
-|-
delta.columnMapping.*| These keys are used to store information about the mapping between the logical column name to the physical name. See [Column Mapping](#column-mapping) for details.
delta.identity.*| These keys are for defining identity columns. See [Identity Columns](#identity-columns) for details.
delta.invariants| JSON string contains SQL expression information. See [Column Invariants](#column-invariants) for details.
delta.generationExpression| SQL expression string. See [Generated Columns](#generated-columns) for details.
<ins>delta.typeChanges</ins>| <ins>JSON string containing information about previous type changes applied to this column. See [Type Change Metadata](#type-change-metadata) for details.</ins>

0 comments on commit 5d25578

Please sign in to comment.