Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2pt] Revise healthcheck predicates #1139

Closed
rosik opened this issue Nov 25, 2020 · 3 comments · Fixed by #1687 or #1731
Closed

[2pt] Revise healthcheck predicates #1139

rosik opened this issue Nov 25, 2020 · 3 comments · Fixed by #1687 or #1731
Assignees
Labels

Comments

@rosik
Copy link
Contributor

rosik commented Nov 25, 2020

Nowadays there are several different health check predicates in Cartridge.

Eventual failover

local member = membership.get_member(server.uri)
if member ~= nil
and (member.status == 'alive' or member.status == 'suspect')
and member.payload.uuid == instance_uuid
and (
member.payload.state == 'ConfiguringRoles' or
member.payload.state == 'RolesConfigured'
) then
appointments[replicaset_uuid] = instance_uuid
break
end

Stateful failover

local server = vars.topology_cfg.servers[instance_uuid]
if server == nil or not topology.not_disabled(instance_uuid, server) then
return false
end
local member = members[server.uri]
if member ~= nil
and (member.status == 'alive' or member.status == 'suspect')
and (member.payload.uuid == instance_uuid)
then
return true
end
return false

RPC

local function member_is_healthy(uri, instance_uuid)
local member = membership.get_member(uri)
return (
(member ~= nil)
and (member.status == 'alive')
and (member.payload.uuid == instance_uuid)
and (
member.payload.state == 'ConfiguringRoles' or
member.payload.state == 'RolesConfigured'
)
)
end

I see two potential problems here:

  1. Neither eventual nor stateful failover isn't triggered until a 'suspect' member becomes dead, but RPC already considers it as unhealthy. As a result, get_active_leaders may return a suspect leader and the RPC call would return an error "No remotes with role %q available" preliminary.
  2. Stateful failover isn't triggered for OperationError, but eventual failover is. It seems to be slightly inconsistent.
@rosik
Copy link
Contributor Author

rosik commented Nov 26, 2020

After a brief discussion with @olegrok and @mtrempoltsev we agreed that RPC should consider 'suspects' as healthy.

@rosik
Copy link
Contributor Author

rosik commented Nov 26, 2020

And I think that OperationError shouldn't be considered unhealthy even if it sounds silly. RPC DoS and additional config applications are even worse taking into account that OperationError is the result of apply_config (or init) failure.

@rosik rosik changed the title Revise healthcheck predicates [2pt] Revise healthcheck predicates Dec 3, 2020
@rosik rosik added this to the Q4 2020 milestone Dec 4, 2020
@rosik rosik removed this from the Q4 2020 milestone Dec 22, 2020
@sharonovd
Copy link
Contributor

Dmitry Sharonov, [15.06.21 14:00]
почему фейловер не сработал, я не понимаю

Dmitry Sharonov, [15.06.21 14:00]
да, мастер гикнулся при накате конфига

Yaroslav Dynnikov, [15.06.21 14:10]
#1139

Да, стейтфул не триггерится на OperationError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants