Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Correctly handle unsigned integer values #1564

Merged
merged 5 commits into from
May 8, 2024

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented May 7, 2024

Description:

Previously there was a bug where values of an unsigned integer column which were received via replication would incorrectly be cast to the bit-equivalent signed integer value. This applied to all widths of integer -- a TINYINT UNSIGNED column could demonstrate this with value 222 being incorrectly captured as -34.

This applied only to replication, backfilled values were and are still handled correctly.

The root of the issue is that (unlike the value encoding used for backfill query results) the MySQL binlog value representation doesn't distinguish based on the signedness of integer values, only their bit-width. Consequently the client library parsed all such values as signed integers of the appropriate width and we serialized them unmodified.

The fix was to keep track of an additional signed/unsigned flag as part of the column type metadata for integer columns. When the column is unsigned we cast the value to an unsigned integer and where necessary (primarily for the 24-bit MEDIUMINT type but we'll handle any weirdness that might theoretically come up by going intN -> int64 -> uint64 -> mask) trim off any sign-extended upper bits.

Workflow steps:

No user action is required, except in cases where we believe that unsigned integer values might previously have been captured incorrectly. In those cases we should re-backfill the impacted bindings, ideally discarding old data and materializing into a fresh destination table.


This change is Reviewable

@willdonnelly willdonnelly requested a review from a team May 7, 2024 22:36
Comment on lines 479 to 486
if sval, ok := asInt64(val); ok && t.Unsigned {
return uint64(sval) & 0xFF, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work, but I don't grok why we need to cast to int64 and then to uint64, instead of directly from int8 to uint8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like asInt64 could instead be an asUnsigned function which does a dynamic check of the actual integer type, and then does a re-interpret cast from signed -> unsigned (if you have an int8 value, return uint8(value)).

Though I suppose mediumint may require a bit more specialized handling (an isMedium bool)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, on reflection I think the asInt64 helper is an unnecessary bit of complexity. I'll replace it with type assertions for the appropriate integer sizes in each case (int8 for tinyint, int16 for smallint, etc) and only apply a mask to mediumint values.

Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment for me to understand how this works, and probably a good idea to have that answer as a comment

Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

Also, can you speak to how migration of the state checkpoints work, from "foo": "int" => "foo": {"type": "int"} ?

Has this dynamic "maybe it's a string or object" always been supported, and you're just using it here?

]
},
"id": {
"type": "integer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice-to-have is that we could add a minimum: 0 constraint here 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 479 to 486
if sval, ok := asInt64(val); ok && t.Unsigned {
return uint64(sval) & 0xFF, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like asInt64 could instead be an asUnsigned function which does a dynamic check of the actual integer type, and then does a re-interpret cast from signed -> unsigned (if you have an int8 value, return uint8(value)).

Though I suppose mediumint may require a bit more specialized handling (an isMedium bool)?

@willdonnelly
Copy link
Member Author

willdonnelly commented May 8, 2024

Also, can you speak to how migration of the state checkpoints work, from "foo": "int" => "foo": {"type": "int"} ?

Has this dynamic "maybe it's a string or object" always been supported, and you're just using it here?

Yes, we previously implemented support for struct-valued type information in MySQL so that we could store the list of enum/set options, and we're just reusing that same capability here to store metadata about integer signedness.

Regarding migration, I don't believe that any of the changes I made in this PR would cause older metadata with a bare string "___int" type to stop working the same way it always has. So captures with metadata from before this fix will continue to have the capturing-unsigned-as-signed bug going forward, but since backfilling the binding is necessary to be sure you've got corrected values and that will also update the metadata this seems acceptable to me.

Currently we capture the values of an unsigned integer column
correctly via backfills, but incorrectly cast uintN->intN when
processing them as replication events (because the value encoding
used in the MySQL binlog doesn't distinguish between signed and
unsigned integers of a particular bit-width).

The behavior is currently incorrect and the snapshot demonstrates
the incorrect negative values obtained via replication.
The bulk of this churn is because of the type metadata changes to support unsigned
integer columns better. There's also a handful of snapshot changes from the recent
PR which made the default backfill mode be unfiltered when the key of a table is a
text column type. Since CI builds have been passing those changes must be from the
long tests that we skip during CI.
…GNED`

The `TestUnsignedIntegers` test shows that we correctly handle replicated
values of unsigned integer columns given correct column metadata from the
initial discovery. This test shows that the DDL query parsing logic also
produces correct metadata for such column types.
Adds a `minimum: 0` clause to the generated JSON schema for unsigned
integer column types.
@willdonnelly willdonnelly force-pushed the wgd/mysql-unsigned-bigint-20240507 branch from fa4c14e to 87c4886 Compare May 8, 2024 16:10
@willdonnelly willdonnelly merged commit 3cd7467 into main May 8, 2024
45 of 46 checks passed
@willdonnelly willdonnelly deleted the wgd/mysql-unsigned-bigint-20240507 branch May 8, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants