st-replication-patcher #4

gordan-bobic · 2024-03-06T18:55:05Z

Sometimes databases in replication can get out of sync for various reasons.
And sometimes re-seeding nodes in complex replication topologies is extremely expensive and difficult. Especially if we use strict gtid mode.

If we tell the database to ignore diplicate or missing row errors and carry on regardless, Something like this will end up in the error log to keep a record of what went wrong with the rows we had to skip:

# grep -A1 'ERROR' mysqld.log | tail
--
2024-03-06 11:13:46 54 [ERROR] mariadbd: Can't find record in 'pbs'
2024-03-06 11:13:46 54 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbs; Can't find record in 'pbs', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000026, end_log_pos 1059031232, Gtid 77-4012-4354934, Internal MariaDB error code: 1032
2024-03-06 11:13:46 70 [ERROR] mariadbd: Can't find record in 'pbst'
2024-03-06 11:13:46 70 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbst; Can't find record in 'pbst', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000026, end_log_pos 1059045525, Gtid 77-4012-4354938, Internal MariaDB error code: 1032
--
2024-03-06 11:14:22 27 [ERROR] mariadbd: Can't find record in 'pbs'
2024-03-06 11:14:22 27 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbs; Can't find record in 'pbs', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000027, end_log_pos 113662791, Gtid 77-4012-4371264, Internal MariaDB error code: 1032
2024-03-06 11:14:22 91 [ERROR] mariadbd: Can't find record in 'pbst'
2024-03-06 11:14:22 91 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbst; Can't find record in 'pbst', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000027, end_log_pos 113664427, Gtid 77-4012-4371266, Internal MariaDB error code: 1032

The purpose of this took is to find such errors, and do what it can to fix the discrepancies. In the above example, for each error it should:

Find the transaction listed in the binlog
Infer the row by PK
Infer the master and establish master access details using master..info
Go to the master and find the row.
Check the row with same PK upstream and locally to make sure the discrepancy actually exists - if it does, do nothing to fix it and emit a notification. It is important that we don't even try to fix things that are verifiably not broken.
Check that none of the tables involved have any triggers that could react in ways we cannot predict. If there are triggers on the table we are about to modify, abort with an error.
Check that the table we are modifying has no inbound foreign keys pointing at it. If there are, abort with an error because such FKs could be doing something like ON DELETE CASCADE and deleting and re-adding a row could cause data destruction.
Delete and replace the row with the exact same values, to make sure that any row that doesn't exist downstream is replaced, and this fix the discrepancy encountered

Parameters should follow a similar pattern to pt-archiver from Percona Toolkit, for example

st-replication_patcher --upstream DSN --local DSN --log-error=/path/to/file --error-filter=[1032|1062] --dry-run

If --local is not provided, check, use normal precedence of attempts that mysql CLI would use:

~/.my.cnf
local socket without username/password

If --upstream is not present (or not present in full), infer anything not provided from $datadir/master.info file for MariaDB, or mysql.slave_master_info table for MySQL (IP, port, username, password).

if --log-rerror is not specified, it should be inferred from log_error global variable on the --local connection

If --error-filter is specified, only consider errors with this error code, e.g. 1032 = missing row, 1062 = duplicate key, it is listed in the example above as error code: 1032. Different types of errors may require different special handling. More may be added in the future, so the fixes should be made in a modular way to facilitate additional error handling for specific errors to be added in the future.

If --dry-run is specified, don't modify any data or logs, only output what has been found and what would be done if --dry-run was not specified.

A very simple local database/log should be kept in ~/.st-replication-patcher.log listing the errors that have already been patched, so that we are not patching the same errors over and over. This can be overridden by deleting the database/log. Ideally the log should be human readable so that it is easy for an operator to selectively delete errors they want re-patched, or paste in an error that they don't want patched to exclude it.

The text was updated successfully, but these errors were encountered:

gordan-bobic assigned oblitorum Mar 6, 2024

gordan-bobic added the enhancement New feature or request label Mar 6, 2024

oblitorum transferred this issue from another repository Mar 11, 2024

gordan-bobic transferred this issue from another repository Aug 2, 2024

gordan-bobic transferred this issue from shatteredsilicon/ssm-submodules Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

st-replication-patcher #4

st-replication-patcher #4

gordan-bobic commented Mar 6, 2024 •

edited

Loading

st-replication-patcher #4

st-replication-patcher #4

Comments

gordan-bobic commented Mar 6, 2024 • edited Loading

gordan-bobic commented Mar 6, 2024 •

edited

Loading