You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the keyspace_events buffer implementation, we see that sometimes buffering stops before the failover has actually been detected and processed by the healthcheck stream, causing buffered queries to be sent to the demoted primary.
Here's the log output from one vtgate process:
{"_time":"2024-07-22T13:23:30.708+00:00","message":"Starting buffering for shard: <redacted>/20-30 (window: 5s, size: 1000, max failover duration: 5s) (A failover was detected by this seen error: vttablet: rpc error: code = Code(17) desc = The MySQL server is running with the --super-read-only option so it cannot execute this statement (errno 1290) (sqlstate HY000) (CallerID: issues_pull_requests_rw_1).)"}
{"_time":"2024-07-22T13:23:33.840+00:00","message":"Starting buffering for shard: <redacted>/30-40 (window: 5s, size: 1000, max failover duration: 5s) (A failover was detected by this seen error: vttablet: rpc error: code = Code(17) desc = The MySQL server is running with the --super-read-only option so it cannot execute this statement (errno 1290) (sqlstate HY000) (CallerID: issues_pull_requests_rw_1).)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Adding 1 to PrimaryPromoted counter for target: keyspace:\"<redacted>\" shard:\"20-30\" tablet_type:REPLICA, tablet: <redacted>-0171233832, tabletType: PRIMARY"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/80-90 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/c0-d0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/b0-c0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/e0-f0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/-10 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/f0- resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/60-70 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/40-50 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/d0-e0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/50-60 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/90-a0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/10-20 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/a0-b0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/70-80 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/20-30 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Stopping buffering for shard: <redacted>/20-30 after: 3.1 seconds due to: a primary promotion has been detected. Draining 50 buffered requests now."}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/30-40 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Stopping buffering for shard: <redacted>/30-40 after: 0.0 seconds due to: a primary promotion has been detected. Draining 1 buffered requests now."}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Draining finished for shard: <redacted>/30-40 Took: 195.979µs for: 1 requests."}
{"_time":"2024-07-22T13:23:33.960+00:00","message":"FailoverTooRecent-<redacted>/30-40: NOT starting buffering for shard: <redacted>/30-40 because the last failover which triggered buffering is too recent (104.215357ms < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT"}
{"_time":"2024-07-22T13:23:34.016+00:00","message":"Draining finished for shard: <redacted>/20-30 Took: 159.349351ms for: 50 requests."}
{"_time":"2024-07-22T13:23:34.642+00:00","message":"not marking healthy primary <redacted>-0171231759 as Up for <redacted>/20-30 because its PrimaryTermStartTime is smaller than the highest known timestamp from previous PRIMARYs <redacted>-0171233832: -62135596800 < 1721654613 "}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"Adding 1 to PrimaryPromoted counter for target: keyspace:\"<redacted>\" shard:\"30-40\" tablet_type:REPLICA, tablet: <redacted>-0171233041, tabletType: PRIMARY"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/50-60 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/60-70 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/40-50 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/d0-e0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/70-80 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/90-a0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/10-20 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/a0-b0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/30-40 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/20-30 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/f0- resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/80-90 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/c0-d0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/b0-c0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/e0-f0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/-10 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:37.282+00:00","message":"not marking healthy primary <redacted>-0171232808 as Up for <redacted>/30-40 because its PrimaryTermStartTime is smaller than the highest known timestamp from previous PRIMARYs <redacted>-0171233041: -62135596800 < 1721654616 "}
{"_time":"2024-07-22T13:23:38.961+00:00","message":"FailoverTooRecent-<redacted>/30-40: skipped 6 log messages"}
{"_time":"2024-07-22T13:23:38.961+00:00","message":"Execute: skipped 3 log messages"}
I think what's happening here is that primaries of the 20-30 and 30-40 shard went into read-only mode due to the external failover at roughly the same time, which in turn caused buffering to start on both these shards in quick succession.
Once the primary failover on shard 20-30 was done and Vitess was notified about the new primary via a TabletExternallyReparented call, the whole keyspace was detected as being consistent again - including the 30-40 shard which was still in the midst of an external failover. This caused the buffering on the 20-30and the 30-40 shard to stop, while the 30-40 shard was not failed over yet.
Queries that performed write operations against the 30-40 shard started noticeably failing, until the external failover was finished.
Reproduction Steps
N/A
Binary Version
v17+
Operating System and Environment details
N/A
Log Fragments
N/A
The text was updated successfully, but these errors were encountered:
@deepthi@vmg This wasn't an issue in v17 and earlier with --buffer_implementation=healthcheck - but that implementation was deprecated and removed in v18.
I'm a bit at a loss of how this could be fixed. Buffering starts because the vtgate notices that the vttablet is in read-only mode (but still serving), but keyspace events don't know about this and instead make decisions based solely on the serving state of the primary (which in this case is happily reporting that it's up and healthy even though it's in readonly mode during the external failover).
Overview of the Issue
With the
keyspace_events
buffer implementation, we see that sometimes buffering stops before the failover has actually been detected and processed by thehealthcheck
stream, causing buffered queries to be sent to the demoted primary.Here's the log output from one
vtgate
process:I think what's happening here is that primaries of the
20-30
and30-40
shard went into read-only mode due to the external failover at roughly the same time, which in turn caused buffering to start on both these shards in quick succession.Once the primary failover on shard
20-30
was done and Vitess was notified about the new primary via aTabletExternallyReparented
call, the whole keyspace was detected as being consistent again - including the30-40
shard which was still in the midst of an external failover. This caused the buffering on the20-30
and the30-40
shard to stop, while the30-40
shard was not failed over yet.Queries that performed write operations against the
30-40
shard started noticeably failing, until the external failover was finished.Reproduction Steps
N/A
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: