Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: Improve query buffering behavior during MoveTables traffic switching #15701

Merged
merged 14 commits into from
Apr 25, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Apr 11, 2024

Description

This PR is the result of investigating #15707 and going through the full call path for how we handle keyspace events and specifically how they are used to detect a MoveTables switch traffic operation — both when it starts and when it finishes — and how queries are handled during this time. The main changes are:

  • Tests
    1. Increasing the QPS for our endtoend test load generator — taking it from ~ 500QPS to ~ 10,000 QPS in local tests — so that I could repeat some of the problems
    2. Switching traffic back and forth much more frequently
    3. Waiting between switches to be sure we catch any errant multiple buffer windows resulting from a single traffic switch
  • Code
    1. Changing how keyspace events are processed to ensure we don't ever miss/skip any in a vtgate
    2. Preventing the query plan cache from being polluted by a query plan that went to the wrong side of the switch (the before side rather than the after) concurrently with the keyspace event being processed and resolved (and thus draining the buffers)
    3. Improving the performance of processing a keyspace event when there is an active MoveTables workflow by getting topo information in parallel
    4. Only performing a single RebuildSrvVSchema operation — which kicks off the keyspace event which is supposed to end the buffering window — per traffic switch (previously if you were switching all traffic we did it twice, once for reads and once for writes)
    5. Base the query retry wait times on the buffering configuration so that we don't have queries regularly waiting far beyond the buffer window and we typically retry the max times within the window
    6. Make 1s the minimum for --buffer_min_time_between_failovers to ensure that we don't start a new buffering window concurrently with resolving the keyspace event, which can result in an errant second buffering window to start while resolving the first — which resulted in a lot of additional query latency and errors along with leaving a bad query plan in the cache which then caused future queries with the same plan to go the wrong side and thus continuously kick off more buffering windows which hit the max window because there's not another keyspace event to resolve them. This change addresses the first part of this issue with the second/concurrent buffer window.
      • When doing a query retry because a MoveTables traffic switch was detected, we always clear the plan cache after retrying when we were not successful on retry. This prevents the second part of the given query shape then failing indefinitely due to the cached bad plan which also then continuously kicked off more buffering windows. (This is item number 2 in the list.)

Related Issue(s)

Fixes: #15707

Checklist

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

vitess-bot bot commented Apr 11, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Apr 11, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone Apr 11, 2024
@mattlord mattlord removed NeedsWebsiteDocsUpdate What it says NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Apr 11, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 31.52174% with 63 lines in your changes are missing coverage. Please review.

Project coverage is 68.42%. Comparing base (4c2df48) to head (7c8d572).

Files Patch % Lines
go/vt/discovery/keyspace_events.go 22.22% 35 Missing ⚠️
go/vt/vtgate/plan_execute.go 57.14% 9 Missing ⚠️
go/vt/topo/etcd2topo/watch.go 60.00% 4 Missing ⚠️
go/vt/vtctl/workflow/traffic_switcher.go 0.00% 4 Missing ⚠️
go/vt/vtctl/workflow/server.go 0.00% 3 Missing ⚠️
go/vt/vtctl/workflow/switcher_dry_run.go 0.00% 3 Missing ⚠️
go/vt/vtctl/workflow/switcher.go 0.00% 2 Missing ⚠️
go/vt/vtgate/buffer/buffer.go 0.00% 2 Missing ⚠️
go/vt/vtgate/buffer/flags.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15701      +/-   ##
==========================================
- Coverage   68.44%   68.42%   -0.02%     
==========================================
  Files        1558     1558              
  Lines      195822   195865      +43     
==========================================
- Hits       134025   134024       -1     
- Misses      61797    61841      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the ks_events_improvements branch 4 times, most recently from fddaf21 to be18fc1 Compare April 12, 2024 14:51
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed NeedsIssue A linked issue is missing for this Pull Request NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Apr 13, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review April 15, 2024 15:19
@mattlord mattlord requested review from vmg and removed request for GuptaManan100 April 15, 2024 15:19
@mattlord mattlord added the NeedsWebsiteDocsUpdate What it says label Apr 15, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshit-gangal can you please review the changes to plan_execute.go?

Rest mostly LGTM. I have some minor questions and comments. Once we have harshit's review and your responses, I can review once again.

go/vt/discovery/keyspace_events.go Show resolved Hide resolved
go/vt/discovery/keyspace_events.go Show resolved Hide resolved
Version: EtcdVersion(initial.Kvs[0].ModRevision),
Version: EtcdVersion(initial.Kvs[0].Version),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these two?

Copy link
Contributor Author

@mattlord mattlord Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revision, ModRevision, and Version are related but distinct things:

I'll try to summarize:

  • Revision: a logical clock used to track changes to the etcd keyspace
  • Version: a logical clock used to track the changes to an individual key since it was created
  • ModRevision: the etcd keyspace revision number for the most recent update to an individual key (the latest version)

What I tried to do is align the Vitess comments and code so that we're matching etcd for which one of these to use. As you can see in this example here, we were using Version in Vitess code and setting it to the ModRevision value from etcd. When I first started the investigation I thought that, aside from this being odd, it could have perhaps led us to miss/skip an update to the key in the watch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one

var currVersion = initial.Header.Revision
var rev = initial.Header.Revision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other actual diff in this file? Everything else looks cosmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go/vt/vtctl/workflow/server.go Show resolved Hide resolved
go/vt/vtctl/workflow/switcher_dry_run.go Show resolved Hide resolved
go/vt/vtgate/buffer/flags.go Show resolved Hide resolved
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

I had just one point where I was not clear about the functionality: where we clear the plan cache.

func (kew *KeyspaceEventWatcher) Subscribe() chan *KeyspaceEvent {
kew.subsMu.Lock()
defer kew.subsMu.Unlock()
c := make(chan *KeyspaceEvent, 2)
// Use a decent size buffer to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably indeed process the most recent version, discarding the rest, We can revisit this separately in the context of multi-tenant migrations, for example, where we may be switching several per-tenant workflows per second for smaller tenants.

rootCause := vterrors.RootCause(err)
if rootCause != nil && strings.Contains(rootCause.Error(), "enforce denied tables") {
log.V(2).Infof("Retry: %d, will retry query %s due to %v", try, query, err)
lastVSchemaCreated = vs.GetCreated()
if try == 0 { // We are going to retry at least once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this deferred clearing of plans when the executor.newExecutor() returns helps?

When we retry, we wait for a newer vschema. Shouldn't we clear the plans before executing the query with the new vschema.

Copy link
Contributor Author

@mattlord mattlord Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this deferred clearing of plans when the executor.newExecutor() returns helps?

As noted in the description, this is the part that prevented the query shape from failing indefinitely and causing a never-ending cycle of keyspace events when there was a ~ concurrent buffer event.

The executor's VSchemaManager clears the plan cache when it receives a new vschema via its SrvVSchema watcher (it calls executor.SaveVSchema() in its watch's subscriber callback). This happens concurrently with the KeyspaceEventWatcher also receiving the new vschema in its SrvVSchema watcher and in its subscriber callback processing it (which includes getting info on all shards from the topo), and eventually determining that the keyspace is consistent and ending the buffering window. So there was a race with query retries such that a query could be planned against the wrong side just as the keyspace event is getting resolved and the buffers drained. Then that bad plan is the cached plan for the query until you do another topo.RebuildSrvVSchema/vtctldclient RebuildVSchemaGraph which then causes the VSchemaManager to clear the plan cache. It's essentially a race between the two SrvVSchema watchers and the work they do when a new one is received.

When we retry, we wait for a newer vschema. Shouldn't we clear the plans before executing the query with the new vschema.

As I noted above, the VSchemaManager already clears the plan cache when a new vschema is received. We wait for a newer vschema, but not indefinitely as the wait times out and we retry without it. So if we timeout just as the new vschema is coming in, we could plan with the old vschema, putting that into the just cleared plan cache for all subsequent executions. Do you remember when you, @harshit-gangal, and I were discussing this on Zoom? That's where this idea originated as you and I were both confused by how the query failures continued even after the buffering window ended due to hitting the max duration.

Since we're now making 1s the minimum for time between failovers the likelihood of hitting this should be drastically reduced, but since the results of this happening are worse than if we'd not done any buffering at all, I think it's better to be extra safe here when doing a retry. We should do it VERY rarely overall, and when we do, let's just be sure to prevent this issue from ever happening since it's fairly catastrophic and it's not at all clear to the user what is happening and how to clear it out (vtctldclient RebuildVSchemaGraph) w/o just restarting the vtgate. That was my thinking. If we DID a retry AND the last time we retried (the 3rd try) still encountered an error, we know that the plan used was 1) probably not valid/correct as it failed 2) will be the plan in the cache if we do not clear the plans after it was added to to the cache.

Make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the behavior and comments here: 7675cf8

Sorry for not explaining this more clearly ahead of time! 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was clarified in a call: my misunderstanding was because I was assuming a closure in the check for the err in defer. However err is global to the function and we will not clear the cache if the query succeeds. Apologies for the misunderstanding and thanks for the additional comments ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

@@ -64,11 +64,11 @@ func (e *Executor) newExecute(
logStats *logstats.LogStats,
execPlan planExec, // used when there is a plan to execute
recResult txResult, // used when it's something simple like begin/commit/rollback/savepoint
) error {
// 1: Prepare before planning and execution
) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we have error as a named parameter here? That is not our usual normal practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure that we're checking the returned error in the defer.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
if waitForNewerVSchema(ctx, e, lastVSchemaCreated) {
// Without a wait we fail non-deterministically since the previous vschema will not have
// the updated routing rules.
timeout := e.resolver.scatterConn.gateway.buffer.GetConfig().MaxFailoverDuration / MaxBufferingRetries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why this timeout is calculated in this way?

Copy link
Contributor Author

@mattlord mattlord Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We retry 2 times before giving up. How long we wait before each retry — IF we don't see a newer vschema come in — was previously hardcoded at 30s while the max buffer failover duration defaults to 20s. I thought it made more sense to instead base the wait time on how long the buffering window is.

My thinking for the calculation was that this way we should be able to perform the max retries within the given window of time for many queries (certainly not all) and we should not end up waiting too long after the buffer window has ended, retrying old queries.

Before we could end up waiting 60 seconds to retry 2 times and continuing to retry when the buffering window ended 30+ seconds ago. That's of course assuming that we didn't receive a newer vschema.

Thinking about it again... I think this improves some failure scenarios/conditions but doesn't make the typical happy path much better as there's no point in retrying when we haven't gotten a new vschema UNLESS the traffic switch failed and we backed out and those queries should now succeed with the original targets/routing. Let's discuss tomorrow. I'm happy to change this... and should add a comment either way. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that we discussed this directly and were both OK with the calculation. I will add a comment, however, to describe the reasoning (which I should have already done).

select {
case c <- th:
default:
}
c <- ev
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change ensures that broadcast happens.

Copy link
Member

@harshit-gangal harshit-gangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good overall, some clarification questions

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed the NeedsWebsiteDocsUpdate What it says label Apr 25, 2024
@mattlord mattlord merged commit ee6b837 into vitessio:main Apr 25, 2024
104 of 105 checks passed
@mattlord mattlord deleted the ks_events_improvements branch April 25, 2024 16:58
@mattlord mattlord added Backport to: release-18.0 Needs to be back ported to release-18.0 Backport to: release-19.0 Needs to be back ported to release-19.0 and removed Backport to: release-18.0 Needs to be back ported to release-18.0 Backport to: release-19.0 Needs to be back ported to release-19.0 labels May 15, 2024
arthurschreiber pushed a commit to github/vitess-gh that referenced this pull request Nov 8, 2024
…fic switching (vitessio#15701)

Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Bug Report: VTGate persistent query errors after MoveTables SwitchTraffic
4 participants