-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to prevent stopping buffering prematurely #17013
Fix to prevent stopping buffering prematurely #17013
Conversation
Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17013 +/- ##
==========================================
- Coverage 67.32% 67.31% -0.01%
==========================================
Files 1569 1569
Lines 252548 252627 +79
==========================================
+ Hits 170023 170067 +44
- Misses 82525 82560 +35 ☔ View full report in Codecov by Sentry. |
It might be easier to reproduce this via an "external" failover, as it's simpler to get the sequencing right:
|
Thanks Arthur! I was able to reproduce the problem now! |
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…-stop Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
d4b042c
to
9aaab35
Compare
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
I added backport labels. This is a bug and is quite unexpected and confusing when run into. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@GuptaManan100 Is there anything outstanding before this can land? I can try and see if I can give these changes a go in our environments to see if the issues we've run into before are all fixed. |
@arthurschreiber Its complete from my side, just waiting for Deepthi to review it. It would be great if you can test it out though! |
Signed-off-by: Manan Gupta <manan@planetscale.com>
…-stop Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 Just wanted to let you know that we've been running a backport of this (plus a few other of the buffering related bugixes) to v18 for the past few days, and have not encountered any issues whatsoever. 🎉 |
Perfect! Thank you! |
Description
I'll summarise the issue described in #16438. Please follow all the comments in the issue for more details.
The problem that was noticed was that in the new keyspace events implementation of buffering, when 2 shards were being externally reparented around the same time, the buffering implementation would stop buffering for both shards when the first shard completed reparenting. This fails the writes on the second shard which is still in the midst of reparenting.
I was able to find the underlying problem. Basically, the order of steps are something like this -
The problem was that when we start buffering, we weren't coordinating with the keyspace event watcher. If it doesn't receive a healthcheck from the primary tablet saying its not-serving, it doesn't even know the shard is in the midst of reparenting.
After realising this, I tried out a change where-in, we mark the shard non-serving in the shard state of the keyspace event watcher when we start buffering. This solution however didn't work, as it created a new issue, wherein we would stop buffering even when the primary tablet that is being demoted sends a serving health check. This is possible during the grace period of PRS, and also during external reparents.
To mitigate this problem, I've implemented a different solution in this PR. This solution is somewhat similar to how the previous
healthcheck
implementation used to handle stopping of buffering.The proposed solution, takes the previous changes a bit further, and when we start buffering after receiving an error that tells us a reparent has started, we ask the keyspace event watcher to only mark the shard serving again after it has seen a serving primary with a higher timestamp than the one it knows off.
This change works well for all the cases we want it to, but has one downside. If a PRS call fails midway, and we end up reverting the primary demotion, we won't be able to stop buffering because we wouldn't have seen a new primary. This is a drawback that existed in the previous implementation of
healthcheck
buffering as well.However, I have tried to mitigate this somewhat. In our implementation, if we receive a non-serving healthcheck from a primary tablet after buffering has started, we drop the restriction of only needing to see a new higher timestamp primary. This is because, we only use
waitForReparent
to ensure that we don't prematurely stop buffering when we receive a serving healthcheck from the primary that is being demoted. However, if we receive a non-serving check, then we know that we won't receive any more serving healthchecks until reparent finishes. Specifically, this helps us when PRS fails, but stops gracefully because the new candidate couldn't get caught up in time. In this case, we promote the previous primary back. This too will stop buffering since we aren't explicitly waiting for a new primary.Related Issue(s)
Checklist
Deployment Notes