PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node #1927

venkatesh-prasad-v · 2024-06-12T08:21:25Z

https://perconadev.atlassian.net/browse/PXC-4173

PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node

https://perconadev.atlassian.net/browse/PXC-4173

Problem
=======
PXC replica could enter into a deadlock with multi threaded replication when
--replica-preserve-commit-order is enabled.

The deadlock is more likely to occur on a replica with
--replica-preserve-commit-order enabled when the source server's workload
consists of concurrent executions of DMLs and DDLs. In such a scenario, the
replica shall try to commit them in the same order as that of the source which
will conflict with galera's ordering resulting in a deadlock.

MySQL ordering: Commits are serialized as per their commit order on source server.
Galera ordering: Commits are serialized in a FIFO manner.

Scenario
========
Let's assume we have the following two transactions in the relay log with same
last_committed.

T1: DML, seqno=10
T2: DDL, seqno=11

+-----------+--------------+-------------------------------------------------------+
| Timestamp |    Thread    |                        Activity                       |
+-----------+--------------+-------------------------------------------------------+
|     t0    | Co-ordinator | Assigns T1 to Applier-1                               |
+-----------+--------------+-------------------------------------------------------+
|     t1    | Co-ordinator | Assigns T2 to Applier-2                               |
+-----------+--------------+-------------------------------------------------------+
|     t2    |   Applier-1  | DML starts with seqno = 10                            |
+-----------+--------------+-------------------------------------------------------+
|     t3    |   Applier-2  | DDL starts with seqno = 11                            |
+-----------+--------------+-------------------------------------------------------+
|     t4    |   Applier-2  | DDL calls TOI_begin(), acquires commit monitor        |
+-----------+--------------+-------------------------------------------------------+
|     t5    |   Applier-2  | DDL executes                                          |
+-----------+--------------+-------------------------------------------------------+
|     t5    |   Applier-1  | DML executes                                          |
+-----------+--------------+-------------------------------------------------------+
|     t6    |   Applier-2  | DDL reaches commit_order_manager, finds that          |
|           |              | it needs to wait until Applier-1 is committed         |
+-----------+--------------+-------------------------------------------------------+
|     t7    |   Applier-1  | DML reaches commit, calls commit_order_enter_local()  |
|           |              | and will wait for Applier-2 to release commit monitor |
+-----------+--------------+-------------------------------------------------------+

In the end it will result in
Applier-1: DML executing Xid_log_event::commit waiting for DDL to release commit order
Applier-2: DDL executing ALTER TABLE waiting for applier-1 to commit first

Solution
========
This commit introduces "Async Monitor" at server layer to ensure that the
ordering in galera is done in the same order in the relay log.

Async Monitor is similar to Commit Monitor used in galera which internally
keeps track of the last_left seqno.

If a transaction with seqno > last_left, then such transactions shall wait
until the condition `seqno = last_left + 1` is met, meaning it is its turn to
enter the Commit Monitor in galera.  This ensures that transactions are
registered in galera is the same sequence as on the source.

Lifetime of the Async Monitor:
Async Monitor is created on START REPLICA and is destroyed on STOP REPLICA, and
whenever applier thread exits. Before it is destroyed, the value of the
last_left is stored in the Relay_log_info object and the value is restored on
subsequent START SLAVE.

it-percona-cla · 2024-09-18T09:16:42Z

All committers have signed the CLA.

sql/wsrep_async_monitor.cc

sql/wsrep_mysqld.cc

sql/wsrep_async_monitor.cc

… DDLs via async node https://perconadev.atlassian.net/browse/PXC-4173 Problem ======= PXC replica could enter into a deadlock with multi threaded replication when --replica-preserve-commit-order is enabled. The deadlock is more likely to occur on a replica with --replica-preserve-commit-order enabled when the source server's workload consists of concurrent executions of DMLs and DDLs. In such a scenario, the replica shall try to commit them in the same order as that of the source which will conflict with galera's ordering resulting in a deadlock. MySQL ordering: Commits are serialized as per their commit order on source server. Galera ordering: Commits are serialized in a FIFO manner. Scenario ======== Let's assume we have the following two transactions in the relay log with same last_committed. T1: DML, seqno=10 T2: DDL, seqno=11 +-----------+--------------+-------------------------------------------------------+ | Timestamp | Thread | Activity | +-----------+--------------+-------------------------------------------------------+ | t0 | Co-ordinator | Assigns T1 to Applier-1 | +-----------+--------------+-------------------------------------------------------+ | t1 | Co-ordinator | Assigns T2 to Applier-2 | +-----------+--------------+-------------------------------------------------------+ | t2 | Applier-1 | DML starts with seqno = 10 | +-----------+--------------+-------------------------------------------------------+ | t3 | Applier-2 | DDL starts with seqno = 11 | +-----------+--------------+-------------------------------------------------------+ | t4 | Applier-2 | DDL calls TOI_begin(), acquires commit monitor | +-----------+--------------+-------------------------------------------------------+ | t5 | Applier-2 | DDL executes | +-----------+--------------+-------------------------------------------------------+ | t5 | Applier-1 | DML executes | +-----------+--------------+-------------------------------------------------------+ | t6 | Applier-2 | DDL reaches commit_order_manager, finds that | | | | it needs to wait until Applier-1 is committed | +-----------+--------------+-------------------------------------------------------+ | t7 | Applier-1 | DML reaches commit, calls commit_order_enter_local() | | | | and will wait for Applier-2 to release commit monitor | +-----------+--------------+-------------------------------------------------------+ In the end it will result in Applier-1: DML executing Xid_log_event::commit waiting for DDL to release commit order Applier-2: DDL executing ALTER TABLE waiting for applier-1 to commit first Solution ======== This commit introduces "Async Monitor" at server layer to ensure that the ordering in galera is done in the same order in the relay log. Async Monitor is similar to Commit Monitor used in galera, but differs in how seqnois are tracked within the monitor. The Async Monitor has the following methods. 1. schedule() - Schedules the transactions in as per their order in the the relay log. 2. enter() - Ensures that threads are serialized as per their order. 3. leave() - Ensures that threads are removed from the monitor in the serialized order. 4. skip() - Provides some flexibilty to the monitor to skip transactions. This will be required for handling skipped and filtered transactions. Lifetime of the Async Monitor: Async Monitor is created on START REPLICA and is destroyed on STOP REPLICA, or whenever applier thread exits.

kamil-holubicki · 2024-11-19T12:24:16Z

sql/log_event.cc

+          if (wsrep_async_monitor) {
+            auto seqno = gtid_log_ev->sequence_number;
+            wsrep_async_monitor->schedule(seqno);
+            fprintf(stderr, "Scheduled the seqno: %llu in async monitor\n", seqno);


fprintf
multiple occurences

kamil-holubicki · 2024-11-19T12:26:55Z

sql/rpl_gtid_execution.cc

@@ -37,6 +37,9 @@
 #include "sql/mysqld.h"  // connection_events_loop_aborted
 #include "sql/rpl_gtid.h"
 #include "sql/rpl_rli.h"    // Relay_log_info
+#ifdef WITH_WSREP


Can we keep WSREP-dependant includes after the standard ones, like we do in other cases?

kamil-holubicki · 2024-11-19T12:28:20Z

sql/rpl_rli.h

@@ -63,6 +63,9 @@
 #include "sql/sql_class.h"    // THD
 #include "sql/system_variables.h"
 #include "sql/table.h"
+#ifdef WITH_WSREP
+#include "sql/wsrep_async_monitor.h"


It looks like forward type declaration is enough, and the header can be included in *.cc file

kamil-holubicki · 2024-11-19T12:44:07Z

sql/wsrep_async_monitor.cc

+  std::unique_lock<std::mutex> lock(m_mutex);
+
+  // Check if the sequence number matches the front of the queue.
+  // In a correctly functioning monitor this should not happen


In a correctly functioning monitor this should not happen ->
In a correctly functioning monitor this should always be true

kamil-holubicki · 2024-11-19T12:45:56Z

sql/wsrep_async_monitor.cc

+  }
+
+  // Mark the seqno as skipped
+  skipped_seqnos.insert(seqno);


I can't see any place where seqnos are removed from skipped_seqnos

kamil-holubicki · 2024-11-19T12:55:14Z

sql/wsrep_mysqld.cc

+      if (opt_replica_preserve_commit_order && opt_mts_replica_parallel_workers > 1) {
+        WSREP_ERROR("Async replica thread %u is not monitored by wsrep async monitor",
+                     thd->thread_id());
+        exit(1);


exit?
maybe unireg_abort(), or assert and handle it somehow?

kamil-holubicki · 2024-11-19T12:55:46Z

sql/wsrep_mysqld.cc

+      if (opt_replica_preserve_commit_order && opt_mts_replica_parallel_workers > 1) {
+        WSREP_ERROR("Async replica thread %u is not monitored by wsrep async monitor",
+                     thd->thread_id());
+        exit(1);


kamil-holubicki · 2024-11-19T13:06:17Z

sql/log_event.cc

+          Wsrep_async_monitor *wsrep_async_monitor {rli->get_wsrep_async_monitor()};
+          if (wsrep_async_monitor) {
+            auto seqno = gtid_log_ev->sequence_number;
+            wsrep_async_monitor->schedule(seqno);


Is my understandig correct, that in this point transactions appear in the order they were committed on the source node?
We create the queue of ordered transactions, and will use AsyncMonitor to enforce this order of commit in Galera, right?

kamil-holubicki · 2024-11-19T13:10:58Z

sql/wsrep_async_monitor.cc

+// Method for main thread to add scheduled seqnos
+void Wsrep_async_monitor::schedule(seqno_t seqno) {
+  std::unique_lock<std::mutex> lock(m_mutex);
+  scheduled_seqnos.push(seqno);


seqons should be growing monolithically, right?
Can we add an assert like seqno = prev+1?
If not monolithically growing (why?), at least let's add an assert seqno > prev.
A good code comment would be appreciated

kamil-holubicki · 2024-11-19T13:12:26Z

sql/wsrep_async_monitor.cc

+
+  // Wait until this transaction is at the head of the scheduled queue
+  m_cond.wait(lock, [this, seqno] {
+    // Remove skipped transactions


When this can happen? Could you add a code comment?
I suspect this is the case when A, B, C have been scheduled, then B skipped, but skip() removes only from the front of scheduled_seqno queue

kamil-holubicki · 2024-11-19T13:14:43Z

The commit message says about ordering transactions basing on their seqnos i not way this = prev + 1. But the code does it in a different way (I can understand that logically it is still the same). It would be good to add more details about how the current implementation actually works.

venkatesh-prasad-v self-assigned this Jun 12, 2024

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from 45b94f6 to 91ab128 Compare September 18, 2024 09:16

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch 4 times, most recently from 63a4df8 to 25c09e8 Compare September 19, 2024 14:53

kamil-holubicki reviewed Sep 19, 2024

View reviewed changes

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from 830bcf5 to 1a9149c Compare September 24, 2024 10:45

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from 1a9149c to a1f0c52 Compare October 10, 2024 08:18

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch 10 times, most recently from 30112dc to 9387f0a Compare November 18, 2024 05:42

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from 9387f0a to eafc945 Compare November 18, 2024 08:24

venkatesh-prasad-v marked this pull request as ready for review November 18, 2024 08:25

venkatesh-prasad-v requested a review from kamil-holubicki November 18, 2024 08:25

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from eafc945 to 0972579 Compare November 18, 2024 13:51

venkatesh-prasad-v force-pushed the 8.0-PXC-4173-monitor branch from 0972579 to fe1bd96 Compare November 18, 2024 13:54

kamil-holubicki requested changes Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node #1927

PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node #1927

venkatesh-prasad-v commented Jun 12, 2024 •

edited

Loading

it-percona-cla commented Sep 18, 2024 •

edited

Loading

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki Nov 19, 2024

kamil-holubicki commented Nov 19, 2024

PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node #1927

Are you sure you want to change the base?

PXC-4173: PXC node stalls with parallel replication workers executing DDLs via async node #1927

Conversation

venkatesh-prasad-v commented Jun 12, 2024 • edited Loading

it-percona-cla commented Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamil-holubicki commented Nov 19, 2024

venkatesh-prasad-v commented Jun 12, 2024 •

edited

Loading

it-percona-cla commented Sep 18, 2024 •

edited

Loading