Slaves lagging by couple of hours are elected as master by orchestrator #83

adkhare · 2017-02-20T19:01:47Z

We are seeing instances where slaves with couple of hours of lag are elected as masters. Is there any configuration for that not to happen?

@shlomi-noach

sougou · 2017-02-20T20:38:50Z

I believe the root cause was that the master reached its connection limit, and orchestrator couldn't connect to it. The master was otherwise healthy.
At the same time, the replicas were heavily lagged.

I'm thinking we could look at special-casing the connection limit error also.

shlomi-noach · 2017-02-21T16:22:44Z

I believe the root cause was that the master reached its connection limit, and orchestrator couldn't connect to it

That shouldn't trigger a failover. orchestrator would failover only when the master is inaccessible by orchestrator AND all of its replicas.

We are seeing instances where slaves with couple of hours of lag are elected as masters

Assuming there was a justified failover scenario in the first place, were there any replicas that were not lagging (or lagging by less than "coupld of hours")?

If so, what was the configuration for such replicas? log-bin? log-slave-updates?
If not, then promoting an hours-lagging replica isn't a good solution, but the least-evil one, and orchestrator did the right thing, by current design. We can certainly consider a configuration such as DoNotFailoverIfAllReplicasAreLaggingByMoreThanSeconds.

adkhare · 2017-02-21T17:49:18Z

@shlomi-noach we do have following config for bin logs:
log-bin = /opt/data/vitess-data/vt_0102548541/bin-logs/vt-0102548541-bin
log-slave-updates

We have 2 replicas configured and we did not have any of those replicas being caught up completely.

I would love to have a config to avoid electing lagging slaves as master because that would definitely cause issues and eventually cause a data loss.

@sougou please feel free to pitch in.

adkhare · 2017-02-21T17:52:56Z

Other option could be that do a failover and elect the nearest slave to master but dont allow any writes to flow in until everything from relay logs is caught up

sougou · 2017-02-21T18:13:53Z

If connection limit couldn't have triggered the failover, we need to find out what did. Is there some log we can check to see what caused it?

I don't think a replica would be reparented until relay logs are caught up. This raises another question: I assume you have semi-sync turned on. This means that at least one of the replicas must have had all the transactions from the master.

The other question is: could the master have been network-partitioned? If so, commits should have failed or at least timed out during that time. Did you see anything like that?

shlomi-noach · 2017-02-21T23:46:34Z

we need to find out what did. Is there some log we can check to see what caused it?

@sougou looking at the GUI check out Audit->Failure Detection, then click the "i" (information) icon on the relevant detection; it will show you the sequence of events leading up to the recovery. For example, you'd see:

Changelog :

    2017-02-20 01:30:54 DeadMaster,
    2017-02-20 01:30:49 UnreachableMaster,
    2017-02-20 01:16:04 NoProblem,
...

What do you see on your dashboard?

Logs-wise, sure, find /var/log/orchestrator.log on the active orchestrator box at the time; grep topology_recovery /var/log/orchestrator.log will give you most of the relevant lines, and otherwise just look at the entire log around the time of recovery. I can help diagnosing it if you like.

sougou · 2017-02-22T03:28:57Z

Amit said he's going on vacation soon. Let's hope he can check this before he leaves.

adkhare · 2017-02-22T15:09:12Z

@shlomi-noach it was DeadMaster. I have turned off orchestrator currently since it was creating data loss issues for us now.

sjmudd · 2017-02-24T17:06:44Z

I think I have bumped into this in the past, though very infrequently and would agree the current behaviour may not be ideal.

As far as I'm aware orchestrator doesn't check if the "downstream slaves" have read and applied all their relay logs and in the situation you mention if for some reason there is replication delay and there are unprocessed relay logs then it would seem wise to wait for this to catch up (to a point) prior to completing failover, or possibly leave this to the DBA to handle manually (not ideal). I think it's been assumed the most up to date slave was fine and probably had processed all relay logs.

So checking for this situation seems like a wise thing to do and then deciding how to handle it would be what I do next.

shlomi-noach · 2017-02-24T17:12:41Z

I've discussed this internally with my team, and none of us see a "correct" behavior.

Again, the scenario:

Master dies
All replicas are lagging by hours

options:

Failover to most up-to-date replica -- dataloss
Failover to most up-to-date replica and block writes until it catches up -- outage for potentially hours
Do nothing -- outage

I don't see a correct behavior on orchestrator's side, and in this particular case we think it's correct to throw the problem back at the user:

Use recovery hooks to make an informed decision
Use PreFailoverProcesses to sleep until replicas catch up with their relay logs
Unset read_only by yourself, based on your own logic

sougou · 2017-02-24T17:43:47Z

The dataloss option will probably be universally unpopular :). Waiting for a replica to catch up sounds better than doing nothing.

Throwing the error back at the user while waiting for the replicas would be ideal because they get the notification, which allows them to intervene if necessary.

shlomi-noach · 2017-02-24T18:01:48Z

The dataloss option will probably be universally unpopular

Let me think the idea of making this configurable in orchestrator.

bbeaudreault · 2017-02-24T18:09:54Z

As someone who's currently integrating Vitess into a rather large mysql deployment and looking to leverage Orchestrator for failover, I've been watching this thread. I have to agree with @sougou that for us data loss is unforgivable. Admittedly this whole scenario would be unlikely to happen in our case, as we would have been paged for slave lag before it got too far. But we currently get paged for master failures, since we don't have automatic failover. So it's already an outage, for us. Orchestrator will cut down on the vast majority of our pages, so I'd be more than happy to get paged and have an outage rather than risk any data loss, in this case.

I think for us, we'd prefer to have orchestrator do nothing if it can't find a healthy replica. But I could see a case for the other option, so making it configurable would be nice.

sjmudd · 2017-02-24T18:42:36Z

Shlomi: I think we agree here the situation is not ideal, and probably needs some preferred behaviour: something like: do_nothing (leaves stuff broken so not good, but orchestrator could alert you via the hooks), failover_fast (and lose data which might not be good at all), failover_after_catchup_completes (so you have "extended" downtime with no master) or maybe even call_out_to_external_hook_to_decide.

Also bear in mind that one answer to all monitored replication chains/clusters may not be optimal.

The other thing to think of is "how many seconds" of delay triggers these options. Again may depend.

I have quite a large deployment and have seen this I think once or twice but mainly because delayed slaves just don't happen, so this sort of situation has been seen very infrequently.

This will be a nice issue to resolve cleanly.

shlomi-noach · 2017-03-20T10:54:33Z

So it's already an outage, for us. Orchestrator will cut down on the vast majority of our pages, so I'd be more than happy to get paged and have an outage rather than risk any data loss, in this case.

I'm going to go with this flow. Given some threshold, if orchestrator sees/thinks (to be determined just how) that best candidate for promotion is still very much behind, it will detect the problem but avoid promotion.

Also bear in mind that one answer to all monitored replication chains/clusters may not be optimal.

Agreed. However: if I make this configurable, this may have a domino effect on other configruations/behavior, and then we'd have even more potential variations of behavior. Already orchestrator has many variations of behaviors based on topology status, that I don't want to further multiple that number.

shlomi-noach · 2017-03-20T13:28:45Z

addressed by #104

shlomi-noach · 2017-03-28T10:05:43Z

@adkhare @sougou https://github.com/github/orchestrator/releases/tag/v2.1.1-BETA should solve your case, you would need to set:

  "FailMasterPromotionIfSQLThreadNotUpToDate": true,

in orchestrator's config file.

Are you able to and are interested in testing this?

arthurnn · 2017-09-29T14:05:15Z

is this done? should we close it?

shlomi-noach · 2017-09-29T16:14:09Z

yep.

shlomi-noach · 2020-04-06T08:33:30Z

#104 was not the only solution. #1115 complements #104 in supporting a new variable, FailMasterPromotionOnLagMinutes.

shlomi-noach mentioned this issue Feb 24, 2017

Avoid promotion if replica not up-to-date with relay log (configurable?) #88

Closed

shlomi-noach mentioned this issue Mar 20, 2017

Supporting FailMasterPromotionIfSQLThreadNotUpToDate config #104

Merged

shlomi-noach closed this as completed Sep 29, 2017

shlomi-noach mentioned this issue Apr 6, 2020

Support for FailMasterPromotionOnLagMinutes #1115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slaves lagging by couple of hours are elected as master by orchestrator #83

Slaves lagging by couple of hours are elected as master by orchestrator #83

adkhare commented Feb 20, 2017 •

edited

Loading

sougou commented Feb 20, 2017

shlomi-noach commented Feb 21, 2017

adkhare commented Feb 21, 2017

adkhare commented Feb 21, 2017

sougou commented Feb 21, 2017

shlomi-noach commented Feb 21, 2017

sougou commented Feb 22, 2017

adkhare commented Feb 22, 2017

sjmudd commented Feb 24, 2017

shlomi-noach commented Feb 24, 2017

sougou commented Feb 24, 2017

shlomi-noach commented Feb 24, 2017

bbeaudreault commented Feb 24, 2017

sjmudd commented Feb 24, 2017

shlomi-noach commented Mar 20, 2017 •

edited

Loading

shlomi-noach commented Mar 20, 2017

shlomi-noach commented Mar 28, 2017

arthurnn commented Sep 29, 2017 •

edited

Loading

shlomi-noach commented Sep 29, 2017

shlomi-noach commented Apr 6, 2020

Slaves lagging by couple of hours are elected as master by orchestrator #83

Slaves lagging by couple of hours are elected as master by orchestrator #83

Comments

adkhare commented Feb 20, 2017 • edited Loading

sougou commented Feb 20, 2017

shlomi-noach commented Feb 21, 2017

adkhare commented Feb 21, 2017

adkhare commented Feb 21, 2017

sougou commented Feb 21, 2017

shlomi-noach commented Feb 21, 2017

sougou commented Feb 22, 2017

adkhare commented Feb 22, 2017

sjmudd commented Feb 24, 2017

shlomi-noach commented Feb 24, 2017

sougou commented Feb 24, 2017

shlomi-noach commented Feb 24, 2017

bbeaudreault commented Feb 24, 2017

sjmudd commented Feb 24, 2017

shlomi-noach commented Mar 20, 2017 • edited Loading

shlomi-noach commented Mar 20, 2017

shlomi-noach commented Mar 28, 2017

arthurnn commented Sep 29, 2017 • edited Loading

shlomi-noach commented Sep 29, 2017

shlomi-noach commented Apr 6, 2020

adkhare commented Feb 20, 2017 •

edited

Loading

shlomi-noach commented Mar 20, 2017 •

edited

Loading

arthurnn commented Sep 29, 2017 •

edited

Loading