source-mysql: Treat CDC updates which change the primary key as implicit deletions #2185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR adds tests to
source-{postgres,mysql,sqlserver}
exercising the capture behavior when a row is updated in a way which modifies its primary key. There are two ways this could be handled:Option (2) is more correct because it results in a deletion of the old row being propagated to any standard / merge update materializations, and also maintains the "the first mention of any previously-nonexisting key should be an insert event" invariant that is sometimes useful. Especially when combined with hard deletions, this means that the destination table will continue to precisely match the source, whereas option (1) would leave behind a bunch of vestigial rows corresponding to old primary-key values which were superseded by later updates.
Previously, SQL Server implemented option (2) while MySQL and Postgres implemented option (1). This was the path of least resistance for each of these captures. The actual meat of this PR is that now
source-mysql
implements option (2) as well by explicitly checking whether the row key is changed between the before and after states of an update event, and emitting the aforementioned synthetic pair of delete/inserts if so.Postgres is not updated in this PR, because while the actual detection logic would be basically identical there's some additional change event plumbing which assumes that each CDC event from the database will yield at most one change event internally, and fixing that is slightly more involved so I'm punting on it for the moment in order to get MySQL fixed sooner. We should come back and address that soon so all the SQL CDC connectors are consistent.
Workflow steps:
Going forward, updates to MySQL tables which modify the primary key of a row will be captured as implicitly deleting the old row-state and inserting the new row-state.
Any preexisting stale rows from primary-key updates which occurred before this change went live will not be cleaned up, remedying that will require manual work or a complete backfill of the impacted dataflows.
This change is