Crash during 3.13 -> 4.0 upgrade when there is an enabled feature flag of a disabled plugin #12963
Replies: 7 comments 19 replies
-
One difference between 3.13.7 and mqtt plugin and 4.0.x and management plugin is that on 3.13.7 with a disabled mqtt plugin I got the following error printout
And out of the 3 feature flags defined by the mqtt plugin ( |
Beta Was this translation helpful? Give feedback.
-
Such exotic scenarios where not all nodes have all the plugins enabled, or they are being disabled and re-enabled during an upgrade… are they a really good use of our time? I highly doubt any feature flag- or upgrades-related integration suites enable plugins "mid-upgrade". Feature flags are considered and documented as cluster-wide state. This means that the enabled set of plugins must be stable during an upgrade, or else the state is mutated by something the feature flag controller has no knowledge of, or control over. |
Beta Was this translation helpful? Give feedback.
-
@gomoripeti I have a fundamental problem with our small team's resources being spent on highly exotic scenarios like this. Disabling and re-enabling plugins during a version upgrade is not a good idea. During a routine restart where the list of plugin changes feature flags should play no role if the version is stable. In a hypothetical scenario where one of the enabled plugin automatically needs to enable feature flags, it will not be able to do that until all nodes are compatible. We can spend a few person-weeks on this or we can spend it on improvements such as Ra 2.16 which deliver double digit efficiency improvements on workflows that affect most QQ users. |
Beta Was this translation helpful? Give feedback.
-
Enabled plugins is a node's state. Feature flags is a cluster-wide state. It would likely raise a lot of eyebrows if a plugin was listed as "disabled" or unavailable due to feature flag state.
This would imply that when a plugin is disabled, so should be its feature flags, and a feature flag cannot be disabled once enabled, or at least it won't be possible while other nodes have the plugin and the FF enabled. Instead of trying to change RabbitMQ, reduce the number of changes are allowed during a version upgrade in your system. A version upgrade is not the time when plugins should be enabled or disabled, but they can be enabled or disabled once all nodes are on the same version, in which case you won't run into any surprises around feature flags. |
Beta Was this translation helpful? Give feedback.
-
I echo Michael's thoughts. I think this scenario is quite unusual, perhaps even artificial. That being said, I think we should print a friendly error, as we did in 3.13.x. |
Beta Was this translation helpful? Give feedback.
-
This scenario is already supported. This is clearly a bug: the feature flag controller isn’t supposed to run a callback on a node that doesn’t know the feature flag. The list of nodes is already filtered, so there is an issue with the filtering or the input of that filtering.
This is expected because a node can have a plugin enabled while other nodes in the cluster don’t have it enabled or don’t have it at all. When that plugin is enabled elsewhere, it is supposed to pick up the feature state from the rest of the cluster. The sync is the same as when a node joins a cluster or is restarted. |
Beta Was this translation helpful? Give feedback.
-
Sorry for the misleading and bad repro steps using 4.0.x and the management plugin. I tried to come up with an example on a recent version but I failed. I'd like to clarify that there is no change happening during the upgrade. The real world issue what we faced is the following (that I also described and attached the logs for)
...while typing the core team already provided a potential patch :) thank you very much, I am giving it a try |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
When a plugin is enabled its feature flags are discovered however when the same plugin is disabled its feature flags remain discovered. It is even possible to enable such feature flag leading to various problems, crashes, and nodes refusing to start.
We have originally faced this issue with the MQTT plugin and the rabbit_mqtt_qos0_queue feature flag. On a new 3.13.7 cluster mqtt plugin was enabled then disabled. Then all feature flags were enabled enabling
rabbit_mqtt_qos0_queue
. Finally during a rolling upgrade to 4.0.4 the bellow crash was seen preventing the first node to start:boot crash
But similar issues can be reproduced on new 4.0.4 cluster with the rabbitmq_management plugin and
detailed_queues_endpoint
feature flag. See repro steps.It is possible that feature flag handling changed between 3.13.7 and 4.0 so that this crash that happened during upgrade couldn't happen for later upgrades (eg 4.0.x -> 4.1.x)
Reproduction steps
Start local 3-node 4.0.x cluster from git repo
Plugins are not enabled and the
detailed_queues_endpoint
feature flag is not availableEnable management plugin
The feature flag is listed on node-1 as disabled. This is already unexpected. I would expect it to be "unavailable" as it is not available on all nodes
Disable the management plugin
Feature flag still listed on node-1
Trying to enable the plugin on node-2 leads to a crash and node-2 goes down
Expected behavior
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions