Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

RestartReplicationQuick causing increase in replication lag #1308

Closed
gsraman opened this issue Feb 16, 2021 · 1 comment
Closed

RestartReplicationQuick causing increase in replication lag #1308

gsraman opened this issue Feb 16, 2021 · 1 comment

Comments

@gsraman
Copy link
Contributor

gsraman commented Feb 16, 2021

When an UnreachableMasterWithLaggingReplicas is detected on the master, the SQL thread and and I/O threads are being restarted as part of the emergent action by the Orchestrator.

We noticed that stop and start of SQL thread on the replicas causes increase in the replication lag as the transaction being applied has to be rolled back and re-applied from start.

This change was introduced as part of #1010 where SQL thread is being restarted which we believe is causing this issue.

func RestartReplicationQuick(instanceKey *InstanceKey) error {
	for _, cmd := range []string{`stop slave sql_thread`, `stop slave io_thread`, `start slave io_thread`, `start slave sql_thread`} {
		if _, err := ExecInstance(instanceKey, cmd); err != nil {
			return log.Errorf("%+v: RestartReplicationQuick: '%q' failed: %+v", *instanceKey, cmd, err)
		} else {
			log.Infof("%s on %+v as part of RestartReplicationQuick", cmd, *instanceKey)
		}
	}
	return nil
} 

Orchestrator would still be able detect "Too Many Connections" issue even if only the I/O thread of the replica is restarted.

@shlomi-noach Will submit a PR as discussed reverting the code to restart only the I/O thread.

@gsraman gsraman changed the title RestartReplicationQuick causing increase in replication delay RestartReplicationQuick causing increase in replication lag Feb 16, 2021
@shlomi-noach
Copy link
Collaborator

closed by #1309

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants