title | summary | aliases | |||
---|---|---|---|---|---|
TiDB Binlog FAQs |
Learn about the frequently asked questions (FAQs) and answers about TiDB Binlog. |
|
This document collects the frequently asked questions (FAQs) about TiDB Binlog.
-
There is no impact on the query.
-
There is a slight performance impact on
INSERT
,DELETE
andUPDATE
transactions. In latency, a p-binlog is written concurrently in the TiKV prewrite stage before the transactions are committed. Generally, writing binlog is faster than TiKV prewrite, so it does not increase latency. You can check the response time of writing binlog in Pump's monitoring panel.
The latency of TiDB Binlog replication is measured in seconds, which is generally about 3 seconds during off-peak hours.
To replicate data to the downstream MySQL or TiDB cluster, Drainer must have the following privileges:
- Insert
- Update
- Delete
- Create
- Drop
- Alter
- Execute
- Index
- Select
- Create View
-
Check whether Pump's GC works well:
- Check whether the gc_tso time in Pump's monitoring panel is identical with that of the configuration file.
-
If GC works well, perform the following steps to reduce the amount of space required for a single Pump:
-
Modify the GC parameter of Pump to reduce the number of days to retain data.
-
Add pump instances.
-
Execute the following command to check whether the status of Pump is normal and whether all the Pump instances that are not in the offline
state are running.
{{< copyable "shell-regular" >}}
binlogctl -cmd pumps
Then, check whether the Drainer monitor or log outputs corresponding errors. If so, resolve them accordingly.
Check the following monitoring items:
-
For the Drainer Event monitoring metric, check the speed of Drainer replicating
INSERT
,UPDATE
andDELETE
transactions to the downstream per second. -
For the SQL Query Time monitoring metric, check the time Drainer takes to execute SQL statements in the downstream.
Possible causes and solutions for slow replication:
-
If the replicated database contains a table without a primary key or unique index, add a primary key to the table.
-
If the latency between Drainer and the downstream is high, increase the value of the
worker-count
parameter of Drainer. For cross-datacenter replication, it is recommended to deploy Drainer in the downstream. -
If the load in the downstream is not high, increase the value of the
worker-count
parameter of Drainer.
If a Pump instance crashes, Drainer cannot replicate data to the downstream because it cannot obtain the data of this instance. If this Pump instance can recover to the normal state, Drainer resumes replication; if not, perform the following steps:
-
Use binlogctl to change the state of this Pump instance to
offline
to discard the data of this Pump instance. -
Because Drainer cannot obtain the data of this pump instance, the data in the downstream and upstream is inconsistent. In this situation, perform full and incremental backups again. The steps are as follows:
-
Stop the Drainer.
-
Perform a full backup in the upstream.
-
Clear the data in the downstream including the
tidb_binlog.checkpoint
table. -
Restore the full backup to the downstream.
-
Deploy Drainer and use
initialCommitTs
(setinitialCommitTs
as the snapshot timestamp of the full backup) as the start point of initial replication.
-
Checkpoint records the commit-ts
that Drainer replicates to the downstream. When Drainer restarts, it reads the checkpoint and then replicates data to the downstream starting from the corresponding commit-ts
. The ["write save point"] [ts=411222863322546177]
Drainer log means saving the checkpoint with the corresponding timestamp.
Checkpoint is saved in different ways for different types of downstream platforms:
-
For MySQL/TiDB, it is saved in the
tidb_binlog.checkpoint
table. -
For Kafka/file, it is saved in the file of the corresponding configuration directory.
The data of kafka/file contains commit-ts
, so if the checkpoint is lost, you can check the latest commit-ts
of the downstream data by consuming the latest data in the downstream .
Drainer reads the checkpoint when it starts. If Drainer cannot read the checkpoint, it uses the configured initialCommitTs
as the start point of the initial replication.
How to redeploy Drainer on the new machine when Drainer fails and the data in the downstream remains?
If the data in the downstream is not affected, you can redeploy Drainer on the new machine as long as the data can be replicated from the corresponding checkpoint.
-
If the checkpoint is not lost, perform the following steps:
-
Deploy and start a new Drainer (Drainer can read checkpoint and resumes replication).
-
Use binlogctl to change the state of the old Drainer to
offline
.
-
-
If the checkpoint is lost, perform the following steps:
-
To deploy a new Drainer, obtain the
commit-ts
of the old Drainer as theinitialCommitTs
of the new Drainer. -
Use binlogctl to change the state of the old Drainer to
offline
.
-
-
Clean up the cluster and restore a full backup.
-
To restore the latest data of the backup file, use Reparo to set
start-tso
= {snapshot timestamp of the full backup + 1} andend-ts
= 0 (or you can specify a point in time).
How to redeploy Drainer when enabling ignore-error
in Primary-Secondary replication triggers a critical error?
If a critical error is triggered when TiDB fails to write binlog after enabling ignore-error
, TiDB stops writing binlog and binlog data loss occurs. To resume replication, perform the following steps:
-
Stop the Drainer instance.
-
Restart the
tidb-server
instance that triggers critical error and resume writing binlog (TiDB does not write binlog to Pump after critical error is triggered). -
Perform a full backup in the upstream.
-
Clear the data in the downstream including the
tidb_binlog.checkpoint
table. -
Restore the full backup to the downstream.
-
Deploy Drainer and use
initialCommitTs
(setinitialCommitTs
as the snapshot timestamp of the full backup) as the start point of initial replication.
Refer to TiDB Binlog Cluster Operations to learn the description of the Pump or Drainer state and how to start and exit the process.
Pause a Pump or Drainer node when you need to temporarily stop the service. For example:
-
Version upgrade
Use the new binary to restart the service after the process is stopped.
-
Server maintenance
When the server needs a downtime maintenance, exit the process and restart the service after the maintenance is finished.
Close a Pump or Drainer node when you no longer need the service. For example:
-
Pump scale-in
If you do not need too many Pump services, close some of them.
-
Cancelling replication tasks
If you no longer need to replicate data to a downstream database, close the corresponding Drainer node.
-
Service migration
If you need to migrate the service to another server, close the service and re-deploy it on the new server.
-
Directly kill the process.
Note:
Do not use the
kill -9
command. Otherwise, the Pump or Drainer node cannot process signals. -
If the Pump or Drainer node runs in the foreground, pause it by pressing Ctrl+C.
-
Use the
pause-pump
orpause-drainer
command in binlogctl.
Can I use the update-pump
or update-drainer
command in binlogctl to pause the Pump or Drainer service?
No. The update-pump
or update-drainer
command directly modifies the state information saved in PD without notifying Pump or Drainer to perform the corresponding operation. Misusing the two commands can interrupt data replication and might even cause data loss.
Can I use the update-pump
or update-drainer
command in binlogctl to close the Pump or Drainer service?
No. The update-pump
or update-drainer
command directly modifies the state information saved in PD without notifying Pump or Drainer to perform the corresponding operation. Misusing the two commands interrupts data replication and might even cause data inconsistency. For example:
- When a Pump node runs normally or is in the
paused
state, if you use theupdate-pump
command to set the Pump state tooffline
, the Drainer node stops pulling the binlog data from theoffline
Pump. In this situation, the newest binlog cannot be replicated to the Drainer node, causing data inconsistency between upstream and downstream. - When a Drainer node runs normally, if you use the
update-drainer
command to set the Drainer state tooffline
, the newly started Pump node only notifies Drainer nodes in theonline
state. In this situation, theoffline
Drainer fails to pull the binlog data from the Pump node in time, causing data inconsistency between upstream and downstream.
In some abnormal situations, Pump fails to correctly maintain its state. Then, use the update-pump
command to modify the state.
For example, when a Pump process is exited abnormally (caused by directly exiting the process when a panic occurs or mistakenly using the kill -9
command to kill the process), the Pump state information saved in PD is still online
. In this situation, if you do not need to restart Pump to recover the service at the moment, use the update-pump
command to update the Pump state to paused
. Then, interruptions can be avoided when TiDB writes binlogs and Drainer pulls binlogs.
In some abnormal situations, the Drainer node fails to correctly maintain its state, which has influenced the replication task. Then, use the update-drainer
command to modify the state.
For example, when a Drainer process is exited abnormally (caused by directly exiting the process when a panic occurs or mistakenly using the kill -9
command to kill the process), the Drainer state information saved in PD is still online
. When a Pump node is started, it fails to notify the exited Drainer node (the notify drainer ...
error), which cause the Pump node failure. In this situation, use the update-drainer
command to update the Drainer state to paused
and restart the Pump node.
Currently, you can only use the offline-pump
or offline-drainer
command in binlogctl to close a Pump or Drainer node.
You can use the update-pump
command to set the Pump state to offline
in the following situations:
- When a Pump process is exited abnormally and the service cannot be recovered, the replication task is interrupted. If you want to recover the replication and accept some losses of binlog data, use the
update-pump
command to set the Pump state tooffline
. Then, the Drainer node stops pulling binlog from the Pump node and continues replicating data. - Some stale Pump nodes are left over from historical tasks. Their processes have been exited and their services are no longer needed. Then, use the
update-pump
command to set their state tooffline
.
For other situations, use the offline-pump
command to close the Pump service, which is the regular process.
Warning:
Do not use the
update-pump
command unless you can tolerate binlog data loss and data inconsistency between upstream and downstream, or you no longer need the binlog data stored in the Pump node.
Can I use the update-pump
command in binlogctl to set the Pump state to offline
if I want to close a Pump node that is exited and set to paused
?
When a Pump process is exited and the node is in the paused
state, not all the binlog data in the node is consumed in its downstream Drainer node. Therefore, doing so might risk data inconsistency between upstream and downstream. In this situation, restart the Pump and use the offline-pump
command to close the Pump node.
Some stale Drainer nodes are left over from historical tasks. Their processes have been exited and their services are no longer needed. Then, use the update-drainer
command to set their state to offline
.
Can I use SQL operations such as change pump
and change drainer
to pause or close the Pump or Drainer service?
No. For more details on these SQL operations, refer to Use SQL statements to manage Pump or Drainer.
These SQL operations directly modifies the state information saved in PD and are functionally equivalent to the update-pump
and update-drainer
commands in binlogctl. To pause or close the Pump or Drainer service, use the binlogctl tool.
What can I do when some DDL statements supported by the upstream database cause error when executed in the downstream database?
To solve the problem, follow these steps:
-
Check
drainer.log
. Searchexec failed
for the last failed DDL operation before the Drainer process is exited. -
Change the DDL version to the one compatible to the downstream. Perform this step manually in the downstream database.
-
Check
drainer.log
. Search for the failed DDL operation and find thecommit-ts
of this operation. For example:[2020/05/21 09:51:58.019 +08:00] [INFO] [syncer.go:398] ["add ddl item to syncer, you can add this commit ts to `ignore-txn-commit-ts` to skip this ddl if needed"] [sql="ALTER TABLE `test` ADD INDEX (`index1`)"] ["commit ts"=416815754209656834].
-
Modify the
drainer.toml
configuration file. Add thecommit-ts
in theignore-txn-commit-ts
item and restart the Drainer node.
TiDB fails to write to binlog and gets stuck, and listener stopped, waiting for manual stop
appears in the log
In TiDB v3.0.12 and earlier versions, the binlog write failure causes TiDB to report the fatal error. TiDB does not automatically exit but only stops the service, which seems like getting stuck. You can see the listener stopped, waiting for manual stop
error in the log.
You need to determine the specific causes of the binlog write failure. If the failure occurs because binlog is slowly written into the downstream, you can consider scaling out Pump or increasing the timeout time for writing binlog.
Since v3.0.13, the error-reporting logic is optimized. The binlog write failure causes transaction execution to fail and TiDB Binlog will return an error but will not get TiDB stuck.
This issue does not affect the downstream and replication logic.
When the binlog write fails or becomes timeout, TiDB retries writing binlog to the next available Pump node until the write succeeds. Therefore, if the binlog write to a Pump node is slow and causes TiDB timeout (default 15s), then TiDB determines that the write fails and tries to write to the next Pump node. If binlog is actually successfully written to the timeout-causing Pump node, the same binlog is written to multiple Pump nodes. When Drainer processes the binlog, it automatically de-duplicates binlogs with the same TSO, so this duplicate write does not affect the downstream and replication logic.
Reparo is interrupted during the full and incremental restore process. Can I use the last TSO in the log to resume replication?
Yes. Reparo does not automatically enable the safe-mode when you start it. You need to perform the following steps manually:
- After Reparo is interrupted, record the last TSO in the log as
checkpoint-tso
. - Modify the Reparo configuration file, set the configuration item
start-tso
tocheckpoint-tso + 1
, setstop-tso
tocheckpoint-tso + 80,000,000,000
(approximately five minutes after thecheckpoint-tso
), and setsafe-mode
totrue
. Start Reparo, and Reparo replicates data tostop-tso
and then stops automatically. - After Reparo stops automatically, set
start-tso
tocheckpoint tso + 80,000,000,001
, setstop-tso
to0
, and setsafe-mode
tofalse
. Start Reparo to resume replication.