RestartReplicationQuick causing increase in replication lag #1308

gsraman · 2021-02-16T17:09:30Z

When an UnreachableMasterWithLaggingReplicas is detected on the master, the SQL thread and and I/O threads are being restarted as part of the emergent action by the Orchestrator.

We noticed that stop and start of SQL thread on the replicas causes increase in the replication lag as the transaction being applied has to be rolled back and re-applied from start.

This change was introduced as part of #1010 where SQL thread is being restarted which we believe is causing this issue.

func RestartReplicationQuick(instanceKey *InstanceKey) error {
	for _, cmd := range []string{`stop slave sql_thread`, `stop slave io_thread`, `start slave io_thread`, `start slave sql_thread`} {
		if _, err := ExecInstance(instanceKey, cmd); err != nil {
			return log.Errorf("%+v: RestartReplicationQuick: '%q' failed: %+v", *instanceKey, cmd, err)
		} else {
			log.Infof("%s on %+v as part of RestartReplicationQuick", cmd, *instanceKey)
		}
	}
	return nil
}

Orchestrator would still be able detect "Too Many Connections" issue even if only the I/O thread of the replica is restarted.

@shlomi-noach Will submit a PR as discussed reverting the code to restart only the I/O thread.

shlomi-noach · 2021-02-17T11:07:30Z

closed by #1309

gsraman changed the title ~~RestartReplicationQuick causing increase in replication delay~~ RestartReplicationQuick causing increase in replication lag Feb 16, 2021

gsraman mentioned this issue Feb 16, 2021

Do not restart SQL thread in RestartReplicationQuick #1309

Merged

shlomi-noach closed this as completed Feb 17, 2021

shlomi-noach mentioned this issue Jun 8, 2021

Ensure to start SQL thread on recovery emergency operation #1366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RestartReplicationQuick causing increase in replication lag #1308

RestartReplicationQuick causing increase in replication lag #1308

gsraman commented Feb 16, 2021 •

edited by shlomi-noach

Loading

shlomi-noach commented Feb 17, 2021

RestartReplicationQuick causing increase in replication lag #1308

RestartReplicationQuick causing increase in replication lag #1308

Comments

gsraman commented Feb 16, 2021 • edited by shlomi-noach Loading

shlomi-noach commented Feb 17, 2021

gsraman commented Feb 16, 2021 •

edited by shlomi-noach

Loading