materialize-databricks: materialize base64 encoded strings as BINARY columns #1563

williamhbaker · 2024-05-07T20:56:40Z

Description:

Adds the capability to materialize base64 encoded strings as BINARY columns to materialize-databricks.

This change is not fully backward compatible, since existing materializations with base64 encoding annotated string fields will fail validations with their previously created STRING columns. Also the new casting from base64 to bytes will result in the raw bytes being inserted into the STRING column, which technically doesn't break the materialization, but is a departure from previous behavior.

As such, this change has a version bump of the connector, and tasks can be migrated over to it by re-backfilling any bindings with base64 strings in their source collections.

Manual testing done, in addition to running all of the unit and integration tests:

Materialize a collection with base64-encoded strings using the existing v1 version
See that the v1 version of the task creates string columns for these
Try to update the task to the v2 image built from these changes and see that it fails validation
Re-backfill the tables as part of updating to v2 and see that it works
For good measure, add some more data to the collection so that it does more inserts and merges and see that works too

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

williamhbaker · 2024-05-08T21:37:34Z

materialize-databricks/.snapshots/TestSQLGeneration

@@ -21,29 +21,29 @@ SELECT 0, `a-schema`.target_table.flow_document
 	USING (
 		(
 			SELECT
-			key1, key2, boolean, integer, `unsigned-integer`, number, string, flow_document
+			key1::BIGINT, key2::BOOLEAN, unbase64(binary)::BINARY as binary, boolean::BOOLEAN, integer::BIGINT, `unsigned-integer`::DECIMAL(20), number::DOUBLE, string::STRING, flow_document::STRING


I think this is one of the more interesting changes: Doing the casting in the SELECT and then not doing it in the rest of the merge query seems to work, and it simplifies the weird stuff I had to do for the casting template for handling binary columns, which need their base64 base unbase64'd into bytes to store correctly in binary columns. Just calling this out since conceptually I think it works, and it seems to work in tests, but I know we have learned lessons about casting that I might be missing with this.

LGTM, is simpler actually and I can't spot any practical difference between this and the previous approach

williamhbaker · 2024-05-08T21:38:04Z

materialize-databricks/driver.go

@@ -213,10 +213,6 @@ type binding struct {
 	// into the target table. Note that in case of delta updates, "needsMerge"
 	// will always be false
 	needsMerge bool
-
-	mergeInto string


This variables weren't used anywhere, so this is some opportunistic cleanup.

mdibaiee

LGTM, thanks!

materialize-databricks/sqlgen.go

mdibaiee · 2024-05-09T13:59:03Z

materialize-databricks/.snapshots/TestSQLGeneration

@@ -21,29 +21,29 @@ SELECT 0, `a-schema`.target_table.flow_document
 	USING (
 		(
 			SELECT
-			key1, key2, boolean, integer, `unsigned-integer`, number, string, flow_document
+			key1::BIGINT, key2::BOOLEAN, unbase64(binary)::BINARY as binary, boolean::BOOLEAN, integer::BIGINT, `unsigned-integer`::DECIMAL(20), number::DOUBLE, string::STRING, flow_document::STRING


LGTM, is simpler actually and I can't spot any practical difference between this and the previous approach

…columns Adds the capability to materialize base64 encoded strings as BINARY columns to materialize-databricks. This change is not fully backward compatible, since existing materializations with base64 encoding annotated string fields will fail validations with their previously created STRING columns. Also the new casting from base64 to bytes will result in the raw bytes being inserted into the STRING column, which technically doesn't break the materialization, but is a departure from previous behavior. As such, this change has a version bump of the connector, and tasks can be migrated over to it by re-backfilling any bindings with base64 strings in their source collections.

Updates the snapshot outputs for materializations, now that there is a "binary_field" with a base64-encoded string in the tests. All of the SQL materializations other than databricks still just handle this as a string. Elasticsearch and Dynamodb have always handled binary fields, and these test updates reflect that.

…fields with pre-existing string columns We need this hack for now to make sure that v1 tasks cannot be updated to v2 without failing validation for pre-existing string columns that should be binary. A re-backfill of these tables is required.

williamhbaker force-pushed the wb/mat-contenttype branch 3 times, most recently from 727b8a9 to 2edd184 Compare May 8, 2024 21:00

williamhbaker marked this pull request as ready for review May 8, 2024 21:35

williamhbaker requested a review from mdibaiee May 8, 2024 21:35

williamhbaker commented May 8, 2024

View reviewed changes

mdibaiee approved these changes May 9, 2024

View reviewed changes

williamhbaker added 4 commits May 9, 2024 12:17

materialize-dynamodb: fix base64 encoded strings

b679a29

williamhbaker force-pushed the wb/mat-contenttype branch from 0522254 to 6c3ce60 Compare May 9, 2024 16:17

williamhbaker merged commit 7a9f8ac into main May 9, 2024
44 of 46 checks passed

williamhbaker deleted the wb/mat-contenttype branch May 9, 2024 16:50

williamhbaker mentioned this pull request May 9, 2024

materializations: support base64 encoded strings as binary columns #1573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materialize-databricks: materialize base64 encoded strings as BINARY columns #1563

materialize-databricks: materialize base64 encoded strings as BINARY columns #1563

williamhbaker commented May 7, 2024 •

edited

Loading

williamhbaker May 8, 2024

mdibaiee May 9, 2024

williamhbaker May 8, 2024

mdibaiee left a comment

mdibaiee May 9, 2024

materialize-databricks: materialize base64 encoded strings as BINARY columns #1563

materialize-databricks: materialize base64 encoded strings as BINARY columns #1563

Conversation

williamhbaker commented May 7, 2024 • edited Loading

williamhbaker May 8, 2024

Choose a reason for hiding this comment

mdibaiee May 9, 2024

Choose a reason for hiding this comment

williamhbaker May 8, 2024

Choose a reason for hiding this comment

mdibaiee left a comment

Choose a reason for hiding this comment

mdibaiee May 9, 2024

Choose a reason for hiding this comment

williamhbaker commented May 7, 2024 •

edited

Loading