Handling deadlock in MariaDB Galera Cluster

Recently I was working on a data migration script. There was a requirement to migrate data from one database to another; from one schema to another on a daily basis. Though the script was successfully running in development server but in production it was continuously failing for a query with deadlock error.

Issue

REPLACE INTO db2.table2 (id, some_identifier_id, name, created_at, updated_at) (SELECT id, some_identifier_id, name, created_at, updated_at FROM db1.table1 WHERE some_identifier_id IS NOT NULL AND some_identifier_id NOT IN (SELECT some_identifier_id FROM db2.table1 WHERE some_other_identifier_id IS NOT NULL));

Which leads to
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Sample schema in behind

MariaDB [db2]> desc db2.table2;
+----------------------+--------------+------+-----+---------+----------------+
| Field      		| Type         | Null | Key | Default | Extra          |
+----------------------+--------------+------+-----+---------+----------------+
| id         		| int(11)      | NO   | PRI | NULL    | auto_increment |
| some_identifier_id	| varchar(255) | YES  | UNI | NULL    |                |
| name       		| varchar(255) | YES  |     | NULL    |                |
| created_at 		| datetime     | YES  |     | NULL    |                |
| updated_at 		| datetime     | YES  |     | NULL    |                |
+----------------------+--------------+------+-----+---------+----------------+
9 rows in set (0.00 sec)

MariaDB [db2]> desc db1.table1;
+----------------------+--------------+------+-----+---------+----------------+
| Field      		| Type         | Null | Key | Default | Extra          |
+----------------------+--------------+------+-----+---------+----------------+
| id         		| int(11)      | NO   | PRI | NULL    | auto_increment |
| some_identifier_id	| varchar(255) | YES  | MUL | NULL    |                |
| name       		| varchar(255) | YES  |     | NULL    |                |
| created_at 		| datetime     | YES  |     | NULL    |                |
| updated_at 		| datetime     | YES  |     | NULL    |                |
+----------------------+--------------+------+-----+---------+----------------+
9 rows in set (0.00 sec)

MariaDB [db2]> desc db2.table1;
+--------------------------------------+--------------+------+-----+---------+----------------+
| Field					| Type         | Null | Key | Default | Extra          |
+--------------------------------------+--------------+------+-----+---------+----------------+
| id         				| int(11)      | NO   | PRI | NULL    | auto_increment |
| some_identifier_id			| varchar(255) | YES  | UNI | NULL    |                |
| some_other_identifier_id     		| varchar(255) | YES  |     | NULL    |                |
| created_at 				| datetime     | YES  |     | NULL    |                |
| updated_at 				| datetime     | YES  |     | NULL    |                |
+--------------------------------------+--------------+------+-----+---------+----------------+
9 rows in set (0.00 sec)

The source table has approximately 50k records only.

  1. The query works fine in development server but fails in production server. Both has server version: 10.0.15-MariaDB
  2. All the tables are InnoDB; db1.table1 => Collation: latin1_swedish_ci and in db2 both of the tables => Collation: utf8_unicode_ci
  3. Executing the query in any of the individual nodes within cluster leads to failure.

Let’s assume in production I have 5 DB servers with multi-master synchronous replication using Galera cluster. The above points were enough for me to start some research on how Galera works. I even tried some easy suggestions e.g. LOCK IN SHARE MODE, ordering by id or checking the log SHOW ENGINE INNODB STATUS etc. But none of these actually solved the problem.

I was clueless about the issue now. So the only way- checkout any of the documents… whatever is available. After some research I found some useful articles known limitations of mariadb galera cluster, Differences from a standalone MySQL server etc. Which helped me to realize the actual problem.

Do not rely on auto-increment values to be sequential. Galera uses a mechanism based on autoincrement increment to produce unique non-conflicting sequences, so on every single node the sequence will have gaps.

Galera Cluster uses at the cluster-level optimistic concurrency control, which can result in transactions that issue a COMMIT aborting at that stage.

Solution

So it’s quite possible to get deadlock when the replace into query syncing id from db1.table1 to db.table2. Inserting or replacing auto-incremental primary key is the most probable and obvious reason of deadlock in Galera. I removed id from that query and kept some_identifier_id as the unique key to support the same replace query. It stopped that deadlock error almost.

But still the same deadlock error was coming at random e.g. 1/10 times; But now it’s a known behavior of Galera. Galera uses optimistic locking which may results to deadlock; retrying the transaction again is most suggested in that scenario. So to completely solve the problem I followed the suggestion. I’d to modify my script to retry the transaction if there is a failure. I also added few minutes sleep before retrying the same transaction e.g. Retry the transaction 3 times on failure with an interval of 5 minutes. It worked!

[Note: at first discussed here]

Advertisements

2 thoughts on “Handling deadlock in MariaDB Galera Cluster

  1. Nice article !! As per my understanding, no.of Galera nodes in development and production server has to be same to avoid deadlock situations. Because auto_increament jumps by no.of nodes in cluster. If dev cluster machine are not matching with prod cluster then auto_increament will collide and eventually cause deadlocks.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s