Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PXC-4500: When innodb_thread_concurrency is set, the cluster can get stuck during SST #1963

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

kamil-holubicki
Copy link
Contributor

https://perconadev.atlassian.net/browse/PXC-4500

Problem 1:
When innodb_thread_concurrency is set, and the node has a heavy user workload, it can become stuck when it receives a CC event from Galera.

Cause:
Let's say innodb_thread_concurrency = 2.

  1. We have two user threads that have been granted access to InnoDB. They will release the InnoDB access lock after Galera certification, but for certification, they need to acquire LocalMonitor.
  2. At the same time CC happens. The Applier thread acquires LocalMonitor and notifies the application about the new view. The application tries to store the view in wsrep schema. Before doing that it turns off wsrep_on thread's variable (the thread is not marked as wsrep-enabled thread). Then, it tries to enter InnoDB. As we already have two user threads in InnoDB and our thread is not wsrep thread, we have to wait.

The above results in a deadlock:

  1. The user thread is holding InnoDB lock, waiting for LocalMonitor
  2. The Applier thread is holding LocalMonitor, waiting for InnoDB lock

Solution:
The wrong condition was used to detect the wsrep applier thread in innobase_srv_conc_enter_innodb(). The applier thread should always be granted access. Fixed.

Problem 2:
Even after fixing Problem 1, the cluster was stuck during SST.

Cause:
The SST thread creates an SST user. For this, it needs to enter InnoDB. We end up in a similar situation as in Problem 1. The applier thread holds LocalMonitor and waits for SST to finish. The SST thread waits for InnoDB. User threads hold the InnoDB lock and wait for LocalMonitor.

Solution:
Allow SST thread to enter InnoDB always.

@kamil-holubicki
Copy link
Contributor Author

storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
…stuck during SST

https://perconadev.atlassian.net/browse/PXC-4500

Problem 1:
When innodb_thread_concurrency is set and the node has a heavy user
workload, it can become stuck when it receives a CC event from Galera.

Cause:
Let's say innodb_thread_concurrency = 2.
1. We have two user threads that have been granted access to InnoDB.
They will release the InnoDB access lock after Galera certification,
but for certification, they need to acquire LocalMonitor.
2. At the same time CC happens. The Applier thread acquires LocalMonitor
and notifies the application about the new view. The application tries
to store the view in wsrep schema. Before doing that it turns off
wsrep_on thread's variable (the thread is not marked as wsrep-enabled
thread). Then, it tries to enter InnoDB. As we already have two user
threads in InnoDB and our thread is not wsrep thread, we have to wait.

The above results in a deadlock:
1. The user thread is holding InnoDB lock, waiting for LocalMonitor
2. The Applier thread is holding LocalMonitor, waiting for InnoDB lock

Solution:
The wrong condition was used to detect the wsrep applier thread in
innobase_srv_conc_enter_innodb(). The applier thread should always be
granted access. Fixed.

Problem 2:
Even after fixing Problem 1, the cluster was stuck during SST.

Cause:
The SST thread creates an SST user. For this, it needs to enter InnoDB.
We end up in a similar situation as in Problem 1. The applier thread
holds LocalMonitor and waits for SST to finish. The SST thread waits for
InnoDB. User threads hold the InnoDB lock and wait for LocalMonitor.

Solution:
Allow SST thread to enter InnoDB always.
@kamil-holubicki kamil-holubicki merged commit aa205c5 into percona:8.0 Oct 21, 2024
10 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants