-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VTOrc checks and fixes replication misconfiguration issues #15881
Conversation
…rtbeat interval Signed-off-by: Manan Gupta <manan@planetscale.com>
…ormation for the tablet Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
if analysisEntry.Analysis == inst.ReplicaMisconfigured { | ||
heartBeatInterval = float64(analysisEntry.ReplicaNetTimeout) / 2 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use some advice/opinions on this front. I have so far chosen to only specify heartBeatInterval
in the express case where VTOrc has registered a misconfiguration.
The advantage of this approach is that if the replica only has its replication stopped, then VTOrc won't make it run a CHANGE REPLICATION SOURCE
command.
The drawback is that if the replica has no replication source configured and also heartbeat interval misconfigured, then VTOrc will first register NotConnectedToPrimary
and it won't change the heartbeat settings. Only after the replication source has been fixed, will it register ReplicaMisconfigured
. So the whole process of the fix will take 2 fixes.
Another disadvantage is that so far fixReplica
has been fixing all the problems on the replica by configuring it again without caring for what was the initial problem that triggered this fix. We will be changing this assumption, as the analysis will also impact how the fix runs. Moreover, this adds another layer of complexity in replication analysis detection wherein VTOrc must register misconfiguration issues before registering replication not running issues. Because if the replication is not running because of misconfiguration, we want to fix the misconfiguration and not just restart replication. So we must register the misconfiguration problem before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to this approach is to set the heartbeat interval value always. This would mean that whenever we try to fix the replica, we run the CHANGE SOURCE REPLICATION
command always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach that we could take is to add a method to the analysis
like HeartbeatMisconfigured
which returns a boolean if the heartbeat settings are incorrect, and we use this method in detecting the misconfiguration and also in the fixReplica phase to only set the heartbeat value if it is misconfigured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest taking the simpler code path, even if that means that the process fix takes 2 steps. That's immaterial IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we have decided to take the simpler, least risk approach of just fixing everything on the replica when we notice anything is broken
Signed-off-by: Manan Gupta <manan@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with just a bit of concern comapring float64
values.
|
||
// check that writes succeed | ||
utils.VerifyWritesSucceed(t, clusterInfo, curPrimary, []*cluster.Vttablet{replica, otherReplica}, 15*time.Second) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -55,6 +55,8 @@ CREATE TABLE database_instance ( | |||
binary_log_pos bigint NOT NULL, | |||
source_host varchar(128) NOT NULL, | |||
source_port smallint NOT NULL, | |||
replica_net_timeout int NOT NULL, | |||
heartbeat_interval decimal(11,4) NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Although: if we use decimal
, then we should not be doing parseFloat
nor using float64
in the rest of the code.
@@ -468,6 +473,10 @@ func GetReplicationAnalysis(keyspace string, shard string, hints *ReplicationAna | |||
a.Analysis = NotConnectedToPrimary | |||
a.Description = "Not connected to the primary" | |||
// | |||
} else if topo.IsReplicaType(a.TabletType) && !a.IsPrimary && math.Round(a.HeartbeatInterval*2) != float64(a.ReplicaNetTimeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the db value is decimal
, and we read and compare it with a float64(...)
then we lose the decimal
accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the heartbeat interval is configured correctly then it should either be an integer or have .5
as its fractional part because it is half of an integral value. Because we lose some precision, we're using math.Round to round it to the nearest integer after multiplying by 2.
@@ -495,6 +500,8 @@ func readInstanceRow(m sqlutils.RowMap) *Instance { | |||
instance.LogReplicationUpdatesEnabled = m.GetBool("log_replica_updates") | |||
instance.SourceHost = m.GetString("source_host") | |||
instance.SourcePort = m.GetInt("source_port") | |||
instance.ReplicaNetTimeout = m.GetInt32("replica_net_timeout") | |||
instance.HeartbeatInterval = m.GetFloat64("heartbeat_interval") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if analysisEntry.Analysis == inst.ReplicaMisconfigured { | ||
heartBeatInterval = float64(analysisEntry.ReplicaNetTimeout) / 2 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest taking the simpler code path, even if that means that the process fix takes 2 steps. That's immaterial IMO.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15881 +/- ##
==========================================
+ Coverage 68.43% 68.46% +0.02%
==========================================
Files 1559 1559
Lines 196516 196753 +237
==========================================
+ Hits 134479 134699 +220
- Misses 62037 62054 +17 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Description
This PR adds the feature request for #15880.
If the user changes
replic_net_timeout
(https://dev.mysql.com/doc/refman/8.0/en/replication-options-replica.html#sysvar_replica_net_timeout) settings, then VTOrc should detect the misconfiguration of heartbeat intervals, and should set it correctly in https://dev.mysql.com/doc/refman/8.0/en/change-replication-source-to.html#crs-opt-source_heartbeat_period.This PR adds this capability. As part of
FullStatus
RPC, we now also read the heartbeat interval, and the replica net timeout values. These are used by VTOrc to figure out if there is a misconfiguration. If there is, then VTOrc goes ahead and fixes the problem.VTOrc will only be able to fix the problems properly after both VTOrc and vttablet have been upgraded.
If only VTOrc is upgraded (and not VTTablets), then VTOrc will register the misconfiguration issue, but won't be able to fix it, since the vttablet won't have the RPC changes to accept the heartbeat interval to set.
If only VTTablet is upgraded, old VTOrc won't even register the issue.
Related Issue(s)
Checklist
Deployment Notes