Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtorc/vttablet: can't downgrade from v20 to v19 #16300

Closed
derekperkins opened this issue Jun 28, 2024 · 17 comments
Closed

vtorc/vttablet: can't downgrade from v20 to v19 #16300

derekperkins opened this issue Jun 28, 2024 · 17 comments
Assignees
Labels

Comments

@derekperkins
Copy link
Member

Overview of the Issue

We're seeing vttablet OOM incredibly fast on v20.0.0 for some reason, after running fine for a couple weeks. We attempted to downgrade to v19.0.4 to see if that changed anything, but the primary was unable to start. vtorc attempted to recover UndoDemotePrimary and couldn't ever succeed.

SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled'

When I reverted that change back to v20.0.0, vtorc was able to successfully run UndoDemotePrimary

Related issues:

Reproduction Steps

This was tested on a single node keyspace with only a single tablet.

Binary Version

v20.0.0 for most components
downgrading vttablet to v19.0.4

Operating System and Environment details

GKE v1.29

Log Fragments

I0628 23:33:10.017377       1 vtorc.go:205] Force discovered - &{Hostname:uscentral1-keywordsu-0-replica-c-0.vttablet Port:3306 InstanceAlias:uscentral1-0288422300 ServerID:1511431603 ServerUUID:e99beb6f-5ea1-11ee-9732-76800cdb5542 Version:8.0.36 VersionComment:Percona Server (GPL), Release '28', Revision '47601f19'$ FlavorName: ReadOnly:true BinlogFormat:ROW BinlogRowImage:FULL LogBinEnabled:true LogReplicationUpdatesEnabled:true SelfBinlogCoordinates:vt-0288422300-bin.033393:317 SourceHost: SourcePort:0 SourceUUID: AncestryUUID:e99beb6f-5ea1-11ee-9732-76800cdb5542 ReplicaNetTimeout:0 HeartbeatInterval:0 ReplicationSQLThreadRuning:false ReplicationIOThreadRuning:false ReplicationSQLThreadState:-1 ReplicationIOThreadState:-1 HasReplicationFilters:false GTIDMode:ON SupportsOracleGTID:true UsingOracleGTID:false UsingMariaDBGTID:false UsingPseudoGTID:false ReadBinlogCoordinates::0 ExecBinlogCoordinates::0 IsDetached:false RelaylogCoordinates::0 LastSQLError: LastIOError: SecondsBehindPrimary:{Int64:0 Valid:false} SQLDelay:0 ExecutedGtidSet:0c4b8ef3-253a-11ec-98d9-3235e408d1e5:1-609569,14e91046-1cdc-11ed-a65f-d60bf275c515:1-1589772,28d7debe-6d30-11ec-9696-826bd9c444b9:1-32830237,4dda5b4e-0ea6-11ed-a83b-c2d1e4afc67b:1-111205389,684e23fe-0eb4-11ed-a604-b6763a422c36:1-3484661,a351f01d-fb7d-11ed-a660-bee2b573f3bf:1-39363505,a5ef9c7b-0eac-11ed-8bbe-46781259a843:1-75497684,a8de8790-6c23-11ed-af0a-62dad3b25097:1-71206353,bd907242-5ea4-11ee-b501-aa7ed7190bc9:1-2490747,cd9e16c9-2539-11ec-9499-2a7ac907fc94:1-282508005,e99beb6f-5ea1-11ee-9732-76800cdb5542:1-1350063514 GtidPurged:0c4b8ef3-253a-11ec-98d9-3235e408d1e5:1-609569,14e91046-1cdc-11ed-a65f-d60bf275c515:1-1589772,28d7debe-6d30-11ec-9696-826bd9c444b9:1-32830237,4dda5b4e-0ea6-11ed-a83b-c2d1e4afc67b:1-111205389,684e23fe-0eb4-11ed-a604-b6763a422c36:1-3484661,a351f01d-fb7d-11ed-a660-bee2b573f3bf:1-39363505,a5ef9c7b-0eac-11ed-8bbe-46781259a843:1-75497684,a8de8790-6c23-11ed-af0a-62dad3b25097:1-71206353,bd907242-5ea4-11ee-b501-aa7ed7190bc9:1-2490747,cd9e16c9-2539-11ec-9499-2a7ac907fc94:1-282508005,e99beb6f-5ea1-11ee-9732-76800cdb5542:1-1345434303 GtidErrant: primaryExecutedGtidSet: ReplicationLagSeconds:{Int64:0 Valid:false} DataCenter:uscentral1 Region: PhysicalEnvironment: ReplicationDepth:0 IsCoPrimary:false HasReplicationCredentials:false SemiSyncEnforced:false SemiSyncPrimaryEnabled:false SemiSyncReplicaEnabled:false SemiSyncPrimaryTimeout:0 SemiSyncPrimaryWaitForReplicaCount:0 SemiSyncPrimaryStatus:false SemiSyncPrimaryClients:0 SemiSyncReplicaStatus:false LastSeenTimestamp: IsLastCheckValid:true IsUpToDate:true IsRecentlyChecked:true SecondsSinceLastSeen:{Int64:0 Valid:false} AllowTLS:false Problems:[] LastDiscoveryLatency:10.705519ms}, err - <nil>
I0628 23:33:10.017476       1 locks.go:455] Unlocking shard keywordsu/0 for action VTOrc Recovery for PrimaryIsReadOnly on uscentral1-0288422300 with error rpc error: code = Unknown desc = TabletManager.UndoDemotePrimary on uscentral1-0288422300: can't set semi-sync mode: ExecuteFetch(SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled' (errno 1193) (sqlstate HY000) during query: SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0; make sure plugins are loaded in my.cnf
E0628 23:33:10.022063       1 topology_recovery.go:654] rpc error: code = Unknown desc = TabletManager.UndoDemotePrimary on uscentral1-0288422300: can't set semi-sync mode: ExecuteFetch(SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled' (errno 1193) (sqlstate HY000) during query: SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0; make sure plugins are loaded in my.cnf
@derekperkins derekperkins added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: VTorc Vitess Orchestrator integration Component: VTTablet labels Jun 28, 2024
@deepthi
Copy link
Member

deepthi commented Jul 2, 2024

@derekperkins what is your MySQL version? That is important in this case. The PR where we adopted the new commands is #15907 and the way it's written, we'd expect this to work.
We probably also need to see your cnf file because code thinks the semi-sync plugin has not been loaded.

@derekperkins
Copy link
Member Author

We aren't using semi-sync, with durability-policy=none

Percona Server v8.0.36

# ENGINE SETTINGS #
default_storage_engine          = InnoDB
default-tmp-storage-engine      = InnoDB

# OTHER CONFIG #
default_authentication_plugin   = mysql_native_password
secure_file_priv                = NULL
explicit_defaults_for_timestamp = 1
group_concat_max_len            = 4M
event_scheduler                 = 0
symbolic-links                  = 0
character-set-server            = utf8mb4
collation-server                = utf8mb4_unicode_ci
binlog_format                   = ROW
binlog_row_image                = full
binlog_expire_logs_seconds      = 259200 # 3 days
sync-binlog                     = 0
binlog-transaction-compression  = ON
log-error-suppression-list      = MY-013360
slow_query_log                  = OFF

# PERCONA
binlog_space_limit              = 10G
userstat                        = OFF
long_query_time                 = 0
log_slow_rate_type              = query
log_slow_verbosity              = full
log_slow_rate_limit             = 100
max_slowlog_size                = 1G
slow_query_log_always_write_time  = 2
slow_query_log_use_global_control = all

# CACHES AND LIMITS #
tmp-table-size                  = 32M
max-heap-table-size             = 32M
max-connections                 = 2500
thread-cache-size               = 50
open-files-limit                = 65535
table-definition-cache          = 4096
table-open-cache                = 4096

# INNODB #
innodb-flush-method             = O_DIRECT
innodb-log-files-in-group       = 2
innodb-log-file-size            = 4G
innodb-flush-log-at-trx-commit  = 2
innodb_buffer_pool_instances    = 10
innodb_buffer_pool_chunk_size   = 1G
innodb-buffer-pool-size         = 10G
innodb_lock_wait_timeout        = 300
innodb_io_capacity              = 2000
innodb_io_capacity_max          = 4000
innodb_lru_scan_depth           = 2000
innodb_flush_neighbors          = 0
innodb_read_io_threads          = 16
innodb_write_io_threads         = 16
innodb_purge_threads            = 4

@mattlord mattlord removed the Needs Triage This issue needs to be correctly labelled and triaged label Jul 2, 2024
@deepthi
Copy link
Member

deepthi commented Jul 2, 2024

OK. attempting to set semi-sync properties with a durability policy of "none" is clearly a bug. We'll need to fix this, and backport to release-20.0

@derekperkins
Copy link
Member Author

v19 ran fine with these settings for months, and v20 runs fine now, I just wasn't able to downgrade to v19

@derekperkins
Copy link
Member Author

And FWIW, I don't need to downgrade anymore, so this isn't urgent from my perspective.

@GuptaManan100
Copy link
Member

Hello @derekperkins!
This problem seems to be occurring because of #15791. I don't see anything else that changed between v20 and v19.

In order to debug this properly could you tell me the outputs of running the following 2 queries in your MySQL -

SHOW VARIABLES LIKE 'rpl_semi_sync_%_enabled'
SELECT COUNT(*) > 0 AS plugin_loaded FROM information_schema.plugins WHERE plugin_name LIKE 'rpl_semi_sync%'

@deepthi
Copy link
Member

deepthi commented Jul 3, 2024

this isn't urgent from my perspective.

We are treating this as semi-urgent because it breaks upgrade/downgrade for the "none" DurabilityPolicy

@derekperkins
Copy link
Member Author

+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| rpl_semi_sync_replica_enabled | OFF   |
| rpl_semi_sync_source_enabled  | OFF   |
+-------------------------------+-------+

+---------------+
| plugin_loaded |
+---------------+
| 1             |
+---------------+

@GuptaManan100
Copy link
Member

Okay, this is very interesting. From the outputs that you show, it looks like the plugin for semi-sync is loaded.
So, the code in v19, checks that the plugin is loaded, and if it is, then it sets the semi-sync settings with SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0
In v20, we added the code to change the semi-sync command based on which plugin is loaded but that doesn't exist in v19.
Did you by any chance upgrade your MySQL version too, because according to the docs -

From MySQL 8.0.26, new versions of the source and replica plugins are supplied, which replace the terms “master” and “slave” with “source” and “replica” in system variables and status variables. 
If you install the new rpl_semi_sync_source and rpl_semi_sync_replica plugins, the new system variables and status variables are available but the old ones are not. 
If you install the old rpl_semi_sync_master and rpl_semi_sync_slave plugins, the old system variables and status variables are available but the new ones are not. 
You cannot have both the new and the old version of the relevant plugin installed on an instance.

After upgrading to v20, if you upgrade your MySQL version to 8.0.26 or above, then I guess it is not possible to downgrade back to v19.

@derekperkins
Copy link
Member Author

derekperkins commented Jul 4, 2024

We didn't change MySQL versions at any point during this upgrade. We upgraded from v8.0.34 to v8.0.36 while on v19. We've been >= 8.0.26 since Nov 2021 and Vitess v11

@GuptaManan100
Copy link
Member

Did you by any chance change the plugin that was being loaded?

plugin-load = rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so

VS

plugin-load = rpl_semi_sync_source=semisync_source.so;rpl_semi_sync_replica=semisync_replica.so

@GuptaManan100
Copy link
Member

Oh, maybe you didn't do anything explicitly and it just happened implicitly because mysqlctl after the changes in #15791 would start using the new mycnf file for versions over 8.0.26 that loads the new plugin. Let me look into when we use a my.cnf file. Maybe there is a flow that causes Vitess to start using the new my.cnf file which loads the new plugin, but when you downgrade, then the old my.cnf file doesn't take effect 🤷‍♂️. I'll look into when we re-initalize my.cnf files and maybe that will tell us more.

For an immediate workaround though, for anyone facing this issue. To downgrade, just load the old plugins instead of the new ones in my.cnf and restart mysql so that they take effect.

@GuptaManan100
Copy link
Member

GuptaManan100 commented Jul 5, 2024

I looked into this further today, and here is what I found. We reinitialize the my.cnf file in the following cases -

  1. init, init-config, reinit_config etc commands of mysqlctl.
  2. Whenever mysqlctld is started, it either creates the new my.cnf file or it updates the old one incase the generated one is different
  3. Whenever a restore happens, and we remove all the old files and reinitialize mysql that causes a new my.cnf to be generated.

So, what I believe happened is as follows -

  1. You were running all the vitess components at v19 and the my.cnf was loading the old plugin values and everything was working as intended.
  2. Then you upgraded everything to v20, (including mysqlctl I'm assuming) and that would have caused a change to the my.cnf file to start using the new plugin values. Since v20 vttablet supports both the plugin variants, everything was still working fine.
  3. Some issue caused you to downgrade vttablet to v19 (but not mysqlctl, I'm assuming), and when this happened, your my.cnf was still using the new plugin, and v19 of vttablet doesn't support it!

@derekperkins Could you let me know if ☝️ if this is the correct sequence of operations?

If it is, then the fix is to also downgrade mysqlctl which would change the my.cnf to use old plugins again. We'll have to backport some part of #15791 to v19 and do a patch or we'll have to do some changes in v20 so that it doesn't use the new plugins immediately and reintroduce the change in v21.

@GuptaManan100
Copy link
Member

I was able to reproduce the problem and have a fix for it - #16357

@derekperkins
Copy link
Member Author

@GuptaManan100 sorry for the late reply, but yes, your assumptions were exactly correct

@GuptaManan100
Copy link
Member

Awesome! Thank you for confirming! We have a fix ready and should be able to merge it soon 🚀

@GuptaManan100
Copy link
Member

#16357 has been merged. Once we do a patch release, this won't be a problem. Until then the workaround is to downgrade both mysqlctl and vttablet if downgrading from v20.0.0 to v19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants