PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

kamil-holubicki · 2024-10-09T13:51:44Z

https://perconadev.atlassian.net/browse/PXC-4500

Problem 1:
When innodb_thread_concurrency is set, and the node has a heavy user workload, it can become stuck when it receives a CC event from Galera.

Cause:
Let's say innodb_thread_concurrency = 2.

We have two user threads that have been granted access to InnoDB. They will release the InnoDB access lock after Galera certification, but for certification, they need to acquire LocalMonitor.
At the same time CC happens. The Applier thread acquires LocalMonitor and notifies the application about the new view. The application tries to store the view in wsrep schema. Before doing that it turns off wsrep_on thread's variable (the thread is not marked as wsrep-enabled thread). Then, it tries to enter InnoDB. As we already have two user threads in InnoDB and our thread is not wsrep thread, we have to wait.

The above results in a deadlock:

The user thread is holding InnoDB lock, waiting for LocalMonitor
The Applier thread is holding LocalMonitor, waiting for InnoDB lock

Solution:
The wrong condition was used to detect the wsrep applier thread in innobase_srv_conc_enter_innodb(). The applier thread should always be granted access. Fixed.

Problem 2:
Even after fixing Problem 1, the cluster was stuck during SST.

Cause:
The SST thread creates an SST user. For this, it needs to enter InnoDB. We end up in a similar situation as in Problem 1. The applier thread holds LocalMonitor and waits for SST to finish. The SST thread waits for InnoDB. User threads hold the InnoDB lock and wait for LocalMonitor.

Solution:
Allow SST thread to enter InnoDB always.

kamil-holubicki · 2024-10-09T13:52:11Z

https://pxc.cd.percona.com/view/8.0%20parallel%20MTR/job/pxc-8.0-pipeline-parallel-mtr/2895/

mysql-test/suite/galera/t/galera_innodb_thread_concurrency.test

storage/innobase/handler/ha_innodb.cc

…stuck during SST https://perconadev.atlassian.net/browse/PXC-4500 Problem 1: When innodb_thread_concurrency is set and the node has a heavy user workload, it can become stuck when it receives a CC event from Galera. Cause: Let's say innodb_thread_concurrency = 2. 1. We have two user threads that have been granted access to InnoDB. They will release the InnoDB access lock after Galera certification, but for certification, they need to acquire LocalMonitor. 2. At the same time CC happens. The Applier thread acquires LocalMonitor and notifies the application about the new view. The application tries to store the view in wsrep schema. Before doing that it turns off wsrep_on thread's variable (the thread is not marked as wsrep-enabled thread). Then, it tries to enter InnoDB. As we already have two user threads in InnoDB and our thread is not wsrep thread, we have to wait. The above results in a deadlock: 1. The user thread is holding InnoDB lock, waiting for LocalMonitor 2. The Applier thread is holding LocalMonitor, waiting for InnoDB lock Solution: The wrong condition was used to detect the wsrep applier thread in innobase_srv_conc_enter_innodb(). The applier thread should always be granted access. Fixed. Problem 2: Even after fixing Problem 1, the cluster was stuck during SST. Cause: The SST thread creates an SST user. For this, it needs to enter InnoDB. We end up in a similar situation as in Problem 1. The applier thread holds LocalMonitor and waits for SST to finish. The SST thread waits for InnoDB. User threads hold the InnoDB lock and wait for LocalMonitor. Solution: Allow SST thread to enter InnoDB always.

kamil-holubicki requested a review from venkatesh-prasad-v October 9, 2024 13:51

venkatesh-prasad-v requested changes Oct 10, 2024

View reviewed changes

mysql-test/suite/galera/t/galera_innodb_thread_concurrency.test Outdated Show resolved Hide resolved

storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved

storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved

kamil-holubicki force-pushed the PXC-4500-8.0 branch from f64278f to f8b2f56 Compare October 11, 2024 11:10

kamil-holubicki requested a review from venkatesh-prasad-v October 11, 2024 11:10

venkatesh-prasad-v approved these changes Oct 15, 2024

View reviewed changes

kamil-holubicki merged commit aa205c5 into percona:8.0 Oct 21, 2024
10 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

kamil-holubicki commented Oct 9, 2024

kamil-holubicki commented Oct 9, 2024

PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

Conversation

kamil-holubicki commented Oct 9, 2024

kamil-holubicki commented Oct 9, 2024