Recently I was working on a data migration script. There was a requirement to migrate data from one database to another; from one schema to another on a daily basis. Though the script was successfully running in development server but in production it was continuously failing for a query with deadlock error.
Issue
REPLACE INTO db2.table2 (id, some_identifier_id, name, created_at, updated_at) (SELECT id, some_identifier_id, name, created_at, updated_at FROM db1.table1 WHERE some_identifier_id IS NOT NULL AND some_identifier_id NOT IN (SELECT some_identifier_id FROM db2.table1 WHERE some_other_identifier_id IS NOT NULL));
Which leads to
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
Sample schema in behind
MariaDB [db2]> desc db2.table2; +----------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | some_identifier_id | varchar(255) | YES | UNI | NULL | | | name | varchar(255) | YES | | NULL | | | created_at | datetime | YES | | NULL | | | updated_at | datetime | YES | | NULL | | +----------------------+--------------+------+-----+---------+----------------+ 9 rows in set (0.00 sec) MariaDB [db2]> desc db1.table1; +----------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | some_identifier_id | varchar(255) | YES | MUL | NULL | | | name | varchar(255) | YES | | NULL | | | created_at | datetime | YES | | NULL | | | updated_at | datetime | YES | | NULL | | +----------------------+--------------+------+-----+---------+----------------+ 9 rows in set (0.00 sec) MariaDB [db2]> desc db2.table1; +--------------------------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------------------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | some_identifier_id | varchar(255) | YES | UNI | NULL | | | some_other_identifier_id | varchar(255) | YES | | NULL | | | created_at | datetime | YES | | NULL | | | updated_at | datetime | YES | | NULL | | +--------------------------------------+--------------+------+-----+---------+----------------+ 9 rows in set (0.00 sec)
The source table has approximately 50k records only.
- The query works fine in development server but fails in production server. Both has server version: 10.0.15-MariaDB
- All the tables are InnoDB; db1.table1 => Collation: latin1_swedish_ci and in db2 both of the tables => Collation: utf8_unicode_ci
- Executing the query in any of the individual nodes within cluster leads to failure.
Let’s assume in production I have 5 DB servers with multi-master synchronous replication using Galera cluster. The above points were enough for me to start some research on how Galera works. I even tried some easy suggestions e.g. LOCK IN SHARE MODE, ordering by id or checking the log SHOW ENGINE INNODB STATUS etc. But none of these actually solved the problem.
I was clueless about the issue now. So the only way- checkout any of the documents… whatever is available. After some research I found some useful articles known limitations of mariadb galera cluster, Differences from a standalone MySQL server etc. Which helped me to realize the actual problem.
Do not rely on auto-increment values to be sequential. Galera uses a mechanism based on autoincrement increment to produce unique non-conflicting sequences, so on every single node the sequence will have gaps.
Galera Cluster uses at the cluster-level optimistic concurrency control, which can result in transactions that issue a COMMIT aborting at that stage.
Solution
So it’s quite possible to get deadlock when the replace into query syncing id from db1.table1 to db.table2. Inserting or replacing auto-incremental primary key is the most probable and obvious reason of deadlock in Galera. I removed id from that query and kept some_identifier_id as the unique key to support the same replace query. It stopped that deadlock error almost.
But still the same deadlock error was coming at random e.g. 1/10 times; But now it’s a known behavior of Galera. Galera uses optimistic locking which may results to deadlock; retrying the transaction again is most suggested in that scenario. So to completely solve the problem I followed the suggestion. I’d to modify my script to retry the transaction if there is a failure. I also added few minutes sleep before retrying the same transaction e.g. Retry the transaction 3 times on failure with an interval of 5 minutes. It worked!
[Note: at first discussed here]
Nice article !! As per my understanding, no.of Galera nodes in development and production server has to be same to avoid deadlock situations. Because auto_increament jumps by no.of nodes in cluster. If dev cluster machine are not matching with prod cluster then auto_increament will collide and eventually cause deadlocks.
LikeLike
This is a nice and simple workaround. It’s easy to understand and doesn’t require some obtruse feature.
Lesson learnt. MySQL w Galera is a bit more complex than just MySQL.
LikeLiked by 2 people