-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry VStream errors with exponential backoff and switch tablets for repeated errors #16536
Conversation
… if the current one fails Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
go/vt/vtgate/vstream_manager.go
Outdated
if ignoreTablet { | ||
ignoreTablets = append(ignoreTablets, tablet.GetAlias()) | ||
} | ||
ignoreTablets = append(ignoreTablets, tablet.GetAlias()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should do this. At least not behind a flag. This would mean that we would be paying a heavy network cost for most cases where the error was an ephemeral / recoverable one. I think that we should instead try and identify exactly what error message we expect in the new case(s) that we should consider unrecoverable.
You also have to keep in mind here that you could potentially quickly ignore all possible tablets unnecessarily.
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Add a function for determining if we should fail immediately and return the error. There is one case in our tests that we need this for: journal event with partial participants. Although this looks similar to the previous implementation, it is fundamentally more robust. This approach uses an include-list for immediate failure errors, all other cases are done with a generic strategy of exponential backoff & possibly switching tablets for persistent errors. The previous implementation required an include list for (1) retry on same tablet (2) retry on different tablet. We remove the need to maintain these two include-lists which have no guarantee to be constant. We only need to maintain the immediate failure list (which is up to us to decide what we want to fail loudly on). |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16536 +/- ##
==========================================
+ Coverage 68.74% 68.77% +0.03%
==========================================
Files 1556 1556
Lines 199705 199721 +16
==========================================
+ Hits 137292 137364 +72
+ Misses 62413 62357 -56 ☔ View full report in Codecov by Sentry. |
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Remove the need for any error include lists. Raise whatever the last error was when we run out of tablets. |
I am still working on fixing some end to end tests, but the bulk of the changes will stay the same so want to get some early feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @twthorn !
Can you help me understand how this will improve a specific scenario on main? I'm not sure that exponential backoff fits here as we already should be choosing a "random" tablet on main after #12282 and I'm not sure that it makes sense to backoff when the frequency of attempts is already quite low, it should not necessarily be done against the same endpoint, and we often want to select a tablet ASAP.
It would be much easier to think this through if we had a clear use case that is problematic on main today — which some kind of test (manual or otherwise) clearly demonstrates — and which we can then use to demonstrate, test, and confirm the improved behavior in the PR branch.
If we do keep the backoff strategy, any reason not to make it and the related config part of the tabletPickerOptions
? Ideally we could keep the existing behavior by default and allow callers to alter it at their choosing. The tablet picker is used by so many things and we should be extra safe/careful with it as it's very easy for unexpected edge cases to show up that cause issues. Opting into new behavior, for those that want different behavior, would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems nice. I had never even noticed it. I feel like it should probably be in go/vt/grpcbackoff
instead though (with grpcbackoff
as the package name too) as isn't really error related and it is grpc related — really wrapping/extending google.golang.org/grpc/backoff
.
vs.lastError.Record(err) | ||
prevErr = err | ||
|
||
if vs.lastError.ShouldRetry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in the vstream.shouldRetry
function so that the last error consideration is taking into account along in addition to the other considerations. I think this would likely also address the test failures.
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR was closed because it has been stale for 7 days with no activity. |
Description
Improve the retry error logic such that
For retries, we follow best practice of exponential backoff with jitter. Rather than reimplementing what is already in vtadmin, we move this package to make it available for other packages to use. Also use last error for tracking if this is the same error.
Note: an edge case here is that if a tablet keeps erroring out but the error is changing value, then we would never retry another tablet (since the error will not equal a previous, so it resets the timer). Typically, we see the same error persisting (but this isn't guaranteed). To avoid this edge case we could instead simply use exponential backoff and switch regardless of whether the error is repeating. I am not sure if that's desired or if we only want to switch off local if it's the same error.
Related Issue(s)
Checklist
Deployment Notes