Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rpc.get_candidates function #1575

Closed
hackallcode opened this issue Oct 15, 2021 · 6 comments · Fixed by #1724
Closed

Fix rpc.get_candidates function #1575

hackallcode opened this issue Oct 15, 2021 · 6 comments · Fixed by #1724
Labels
bug Something isn't working customer

Comments

@hackallcode
Copy link
Contributor

hackallcode commented Oct 15, 2021

local rpc = require('cartridge.rpc')
local _, err = rpc.call('my_role', ...)
local candidates = rpc.get_candidates('my_role')
log.info(err)
log.info(candidates)

Sometimes the code above may log ["localhost:11002"] and RemoteCallError: "localhost:11002": Role "my_role" unavailable simultaneously. It means that rpc.get_candidates returns those instances that are not actually alive, i.e. nothing can be done on them. So, it's incorrect behavior of rpc.get_candidates.

This seems to be due to the slow writing of the configuration backup to disk. As a result, we may have such that all instances are in the RolesConfigured status, but the config has not yet begun to be applied to some of the instances (in the example it's localhost:11002).

To reproduce this error, you can:

  • create 2 instances (with 11001 and 11002 ports);

  • assign my_role only to the second instance;

  • insert code:

    if confapplier.get_advertise_uri() == 'localhost:11002' then
        local fiber = require('fiber')
        fiber.sleep(3)
    end

    to the 135 line in cartridge/twophase.lua file:

    local function commit_2pc()
    Commit2pcError:assert(
    vars.prepared_config ~= nil,
    "commit isn't prepared"
    )
    local workdir = confapplier.get_workdir()

@yngvar-antonsson yngvar-antonsson added teamS Scaling bug Something isn't working labels Oct 15, 2021
@Steap2448 Steap2448 self-assigned this Oct 28, 2021
@Steap2448
Copy link
Contributor

Steap2448 commented Nov 18, 2021

The original problem was get_candidates (called on A1) returning instances not suitable for requested role (B1). For example it occurs when twophase on B1 is stuck on commit phase and new role has not been applied on it. A1 has no idea that B1 hasn't applied new config and rightfully thinks that B1 is a good candidate for a role, that was previously assigned to it, judging from its local topology. The problem is in different applied configs.

Currently there is no reliable way to know which config is applied on different replicas. All solutions boil down to increasing chances of success. But there is no guarantee.

@olegrok suggests to use topology from applied configuration and specify uri manually. See #1588 for details.

Also consider usage of retries and be vigilant.

@artur-barsegyan
Copy link
Contributor

What is the reason for that triage?

@yngvar-antonsson
Copy link
Collaborator

You can't rely on a result of rpc.get_candidates. After a get_candidates call you can lose one of the candidates. Now we can't fix this issue, so I close it.
To avoid this bug you can:

  • use stateful failover to make your cluster more reliable
  • implement @Steap2448 patch
  • get rid of rpc module calls

@hackallcode
Copy link
Contributor Author

You're wrong! In our case, all instances are alive and in the same state when both functions are called. It's just that the function get_candidates does not make a hard enough check. So, stateful failover and rpc module calls will not work

@hackallcode hackallcode reopened this Dec 17, 2021
@filonenko-mikhail
Copy link
Contributor

get_candidates does not make a hard enough check

What kind of checks do you mean?

@hackallcode
Copy link
Contributor Author

What kind of checks do you mean?

I don't remember a place with this check, but if you make same actions as in issue, you will get error message, generated by this check

@filonenko-mikhail filonenko-mikhail added teamX and removed teamS Scaling labels Jan 12, 2022
olegrok added a commit that referenced this issue Jan 23, 2022
There is a raice condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait untill it will be finished.

Closes #1575
olegrok added a commit that referenced this issue Jan 23, 2022
There is a raise condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait until it will be finished.

Closes #1575
olegrok added a commit that referenced this issue Jan 29, 2022
There is a raise condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait until it will be finished.

Closes #1575
olegrok added a commit that referenced this issue Jan 29, 2022
There is a raise condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait until it will be finished.

Closes #1575
olegrok added a commit that referenced this issue Jan 31, 2022
There is a raise condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait until it will be finished.

Closes #1575
filonenko-mikhail pushed a commit that referenced this issue Jan 31, 2022
There is a raise condition. We could commit config locally but it
could be in progress on some instance. Before this patch user got
unexpected "Role X unavailable" from instance where such role was
assumed.
Solution is an optimistic approach - detect config apply and try
to wait until it will be finished.

Closes #1575
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment