Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Treat CDC updates which change the primary key as implicit deletions #2185

Merged
merged 4 commits into from
Dec 10, 2024

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Dec 5, 2024

Description:

This PR adds tests to source-{postgres,mysql,sqlserver} exercising the capture behavior when a row is updated in a way which modifies its primary key. There are two ways this could be handled:

  1. Just transcribe the update literally, with the new row as the collection key and the old primary-key value in the before-state metadata.
  2. Generate a synthetic pair of a delete event for the old primary-key value, and an insert event for the new value.

Option (2) is more correct because it results in a deletion of the old row being propagated to any standard / merge update materializations, and also maintains the "the first mention of any previously-nonexisting key should be an insert event" invariant that is sometimes useful. Especially when combined with hard deletions, this means that the destination table will continue to precisely match the source, whereas option (1) would leave behind a bunch of vestigial rows corresponding to old primary-key values which were superseded by later updates.

Previously, SQL Server implemented option (2) while MySQL and Postgres implemented option (1). This was the path of least resistance for each of these captures. The actual meat of this PR is that now source-mysql implements option (2) as well by explicitly checking whether the row key is changed between the before and after states of an update event, and emitting the aforementioned synthetic pair of delete/inserts if so.

Postgres is not updated in this PR, because while the actual detection logic would be basically identical there's some additional change event plumbing which assumes that each CDC event from the database will yield at most one change event internally, and fixing that is slightly more involved so I'm punting on it for the moment in order to get MySQL fixed sooner. We should come back and address that soon so all the SQL CDC connectors are consistent.

Workflow steps:

Going forward, updates to MySQL tables which modify the primary key of a row will be captured as implicitly deleting the old row-state and inserting the new row-state.

Any preexisting stale rows from primary-key updates which occurred before this change went live will not be cleaned up, remedying that will require manual work or a complete backfill of the impacted dataflows.


This change is Reviewable

@willdonnelly willdonnelly requested a review from a team December 5, 2024 21:25
Copy link
Contributor

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

sqlcapture/capture.go Outdated Show resolved Hide resolved
@willdonnelly willdonnelly force-pushed the wgd/sql-cdc-deletions-20241205 branch from ee46b90 to c880b40 Compare December 9, 2024 22:51
New test demonstrating current PK update behavior.
New test demonstrating current PK update behavior.
New test demonstrating current PK update behavior. Unlike Postgres
and MySQL, the SQL Server behavior is already the more-correct one
where it issues a separate delete and create event.
This is fairly straightforward: if the primary key is changed by
a CDC update event, then instead of an update document we should
capture a synthetic deletion of the old row-state and an insert
of the new row-state.

Interestingly, since we base the implicit-delete decision on
the serialized row key, which is based on the configured key
columns of a binding, which typically come from the collection
spec, this should work correctly even if the output collection
key differs from the source DB primary key. But I haven't tested
that, I was just thinking of the obvious weird cases and I think
that's accurate.
@willdonnelly willdonnelly force-pushed the wgd/sql-cdc-deletions-20241205 branch from c880b40 to 291023b Compare December 9, 2024 22:55
@willdonnelly willdonnelly merged commit f6b2444 into main Dec 10, 2024
52 of 53 checks passed
@willdonnelly willdonnelly deleted the wgd/sql-cdc-deletions-20241205 branch December 10, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants