-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
materialize-databricks: materialize base64 encoded strings as BINARY columns #1563
Conversation
727b8a9
to
2edd184
Compare
@@ -21,29 +21,29 @@ SELECT 0, `a-schema`.target_table.flow_document | |||
USING ( | |||
( | |||
SELECT | |||
key1, key2, boolean, integer, `unsigned-integer`, number, string, flow_document | |||
key1::BIGINT, key2::BOOLEAN, unbase64(binary)::BINARY as binary, boolean::BOOLEAN, integer::BIGINT, `unsigned-integer`::DECIMAL(20), number::DOUBLE, string::STRING, flow_document::STRING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is one of the more interesting changes: Doing the casting in the SELECT
and then not doing it in the rest of the merge query seems to work, and it simplifies the weird stuff I had to do for the casting template for handling binary columns, which need their base64 base unbase64'd into bytes to store correctly in binary columns. Just calling this out since conceptually I think it works, and it seems to work in tests, but I know we have learned lessons about casting that I might be missing with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, is simpler actually and I can't spot any practical difference between this and the previous approach
@@ -213,10 +213,6 @@ type binding struct { | |||
// into the target table. Note that in case of delta updates, "needsMerge" | |||
// will always be false | |||
needsMerge bool | |||
|
|||
mergeInto string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variables weren't used anywhere, so this is some opportunistic cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@@ -21,29 +21,29 @@ SELECT 0, `a-schema`.target_table.flow_document | |||
USING ( | |||
( | |||
SELECT | |||
key1, key2, boolean, integer, `unsigned-integer`, number, string, flow_document | |||
key1::BIGINT, key2::BOOLEAN, unbase64(binary)::BINARY as binary, boolean::BOOLEAN, integer::BIGINT, `unsigned-integer`::DECIMAL(20), number::DOUBLE, string::STRING, flow_document::STRING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, is simpler actually and I can't spot any practical difference between this and the previous approach
…columns Adds the capability to materialize base64 encoded strings as BINARY columns to materialize-databricks. This change is not fully backward compatible, since existing materializations with base64 encoding annotated string fields will fail validations with their previously created STRING columns. Also the new casting from base64 to bytes will result in the raw bytes being inserted into the STRING column, which technically doesn't break the materialization, but is a departure from previous behavior. As such, this change has a version bump of the connector, and tasks can be migrated over to it by re-backfilling any bindings with base64 strings in their source collections.
Updates the snapshot outputs for materializations, now that there is a "binary_field" with a base64-encoded string in the tests. All of the SQL materializations other than databricks still just handle this as a string. Elasticsearch and Dynamodb have always handled binary fields, and these test updates reflect that.
…fields with pre-existing string columns We need this hack for now to make sure that v1 tasks cannot be updated to v2 without failing validation for pre-existing string columns that should be binary. A re-backfill of these tables is required.
0522254
to
6c3ce60
Compare
Description:
Adds the capability to materialize base64 encoded strings as BINARY columns to materialize-databricks.
This change is not fully backward compatible, since existing materializations with base64 encoding annotated string fields will fail validations with their previously created STRING columns. Also the new casting from base64 to bytes will result in the raw bytes being inserted into the STRING column, which technically doesn't break the materialization, but is a departure from previous behavior.
As such, this change has a version bump of the connector, and tasks can be migrated over to it by re-backfilling any bindings with base64 strings in their source collections.
Manual testing done, in addition to running all of the unit and integration tests:
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is