Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st-replication-patcher #4

Open
gordan-bobic opened this issue Mar 6, 2024 · 0 comments
Open

st-replication-patcher #4

gordan-bobic opened this issue Mar 6, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@gordan-bobic
Copy link
Contributor

gordan-bobic commented Mar 6, 2024

Sometimes databases in replication can get out of sync for various reasons.
And sometimes re-seeding nodes in complex replication topologies is extremely expensive and difficult. Especially if we use strict gtid mode.

If we tell the database to ignore diplicate or missing row errors and carry on regardless, Something like this will end up in the error log to keep a record of what went wrong with the rows we had to skip:

# grep -A1 'ERROR' mysqld.log | tail
--
2024-03-06 11:13:46 54 [ERROR] mariadbd: Can't find record in 'pbs'
2024-03-06 11:13:46 54 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbs; Can't find record in 'pbs', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000026, end_log_pos 1059031232, Gtid 77-4012-4354934, Internal MariaDB error code: 1032
2024-03-06 11:13:46 70 [ERROR] mariadbd: Can't find record in 'pbst'
2024-03-06 11:13:46 70 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbst; Can't find record in 'pbst', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000026, end_log_pos 1059045525, Gtid 77-4012-4354938, Internal MariaDB error code: 1032
--
2024-03-06 11:14:22 27 [ERROR] mariadbd: Can't find record in 'pbs'
2024-03-06 11:14:22 27 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbs; Can't find record in 'pbs', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000027, end_log_pos 113662791, Gtid 77-4012-4371264, Internal MariaDB error code: 1032
2024-03-06 11:14:22 91 [ERROR] mariadbd: Can't find record in 'pbst'
2024-03-06 11:14:22 91 [Warning] Slave SQL: Could not execute Update_rows_v1 event on table pffdb.pbst; Can't find record in 'pbst', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000027, end_log_pos 113664427, Gtid 77-4012-4371266, Internal MariaDB error code: 1032

The purpose of this took is to find such errors, and do what it can to fix the discrepancies. In the above example, for each error it should:

  • Find the transaction listed in the binlog
  • Infer the row by PK
  • Infer the master and establish master access details using master..info
  • Go to the master and find the row.
  • Check the row with same PK upstream and locally to make sure the discrepancy actually exists - if it does, do nothing to fix it and emit a notification. It is important that we don't even try to fix things that are verifiably not broken.
  • Check that none of the tables involved have any triggers that could react in ways we cannot predict. If there are triggers on the table we are about to modify, abort with an error.
  • Check that the table we are modifying has no inbound foreign keys pointing at it. If there are, abort with an error because such FKs could be doing something like ON DELETE CASCADE and deleting and re-adding a row could cause data destruction.
  • Delete and replace the row with the exact same values, to make sure that any row that doesn't exist downstream is replaced, and this fix the discrepancy encountered

Parameters should follow a similar pattern to pt-archiver from Percona Toolkit, for example

st-replication_patcher --upstream DSN --local DSN --log-error=/path/to/file --error-filter=[1032|1062] --dry-run

If --local is not provided, check, use normal precedence of attempts that mysql CLI would use:

  1. ~/.my.cnf
  2. local socket without username/password

If --upstream is not present (or not present in full), infer anything not provided from $datadir/master.info file for MariaDB, or mysql.slave_master_info table for MySQL (IP, port, username, password).

if --log-rerror is not specified, it should be inferred from log_error global variable on the --local connection

If --error-filter is specified, only consider errors with this error code, e.g. 1032 = missing row, 1062 = duplicate key, it is listed in the example above as error code: 1032. Different types of errors may require different special handling. More may be added in the future, so the fixes should be made in a modular way to facilitate additional error handling for specific errors to be added in the future.

If --dry-run is specified, don't modify any data or logs, only output what has been found and what would be done if --dry-run was not specified.

A very simple local database/log should be kept in ~/.st-replication-patcher.log listing the errors that have already been patched, so that we are not patching the same errors over and over. This can be overridden by deleting the database/log. Ideally the log should be human readable so that it is easy for an operator to selectively delete errors they want re-patched, or paste in an error that they don't want patched to exclude it.

@gordan-bobic gordan-bobic added the enhancement New feature or request label Mar 6, 2024
@oblitorum oblitorum transferred this issue from another repository Mar 11, 2024
@gordan-bobic gordan-bobic transferred this issue from another repository Aug 2, 2024
@gordan-bobic gordan-bobic transferred this issue from shatteredsilicon/ssm-submodules Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants