Fix `rpc.get_candidates` function #1575

hackallcode · 2021-10-15T09:42:55Z

local rpc = require('cartridge.rpc')
local _, err = rpc.call('my_role', ...)
local candidates = rpc.get_candidates('my_role')
log.info(err)
log.info(candidates)

Sometimes the code above may log ["localhost:11002"] and RemoteCallError: "localhost:11002": Role "my_role" unavailable simultaneously. It means that rpc.get_candidates returns those instances that are not actually alive, i.e. nothing can be done on them. So, it's incorrect behavior of rpc.get_candidates.

This seems to be due to the slow writing of the configuration backup to disk. As a result, we may have such that all instances are in the RolesConfigured status, but the config has not yet begun to be applied to some of the instances (in the example it's localhost:11002).

To reproduce this error, you can:

create 2 instances (with 11001 and 11002 ports);
assign my_role only to the second instance;

insert code:

if confapplier.get_advertise_uri() == 'localhost:11002' then
    local fiber = require('fiber')
    fiber.sleep(3)
end

to the 135 line in cartridge/twophase.lua file:

cartridge/cartridge/twophase.lua

Lines 130 to 136 in e887629

    
           local function commit_2pc() 
        
               Commit2pcError:assert( 
        
                   vars.prepared_config ~= nil, 
        
                   "commit isn't prepared" 
        
               ) 
        
               local workdir = confapplier.get_workdir()

The text was updated successfully, but these errors were encountered:

Steap2448 · 2021-11-18T14:45:26Z

The original problem was get_candidates (called on A1) returning instances not suitable for requested role (B1). For example it occurs when twophase on B1 is stuck on commit phase and new role has not been applied on it. A1 has no idea that B1 hasn't applied new config and rightfully thinks that B1 is a good candidate for a role, that was previously assigned to it, judging from its local topology. The problem is in different applied configs.

Currently there is no reliable way to know which config is applied on different replicas. All solutions boil down to increasing chances of success. But there is no guarantee.

@olegrok suggests to use topology from applied configuration and specify uri manually. See #1588 for details.

Also consider usage of retries and be vigilant.

artur-barsegyan · 2021-12-15T11:10:27Z

What is the reason for that triage?

yngvar-antonsson · 2021-12-17T00:10:55Z

You can't rely on a result of rpc.get_candidates. After a get_candidates call you can lose one of the candidates. Now we can't fix this issue, so I close it.
To avoid this bug you can:

use stateful failover to make your cluster more reliable
implement @Steap2448 patch
get rid of rpc module calls

hackallcode · 2021-12-17T09:00:53Z

You're wrong! In our case, all instances are alive and in the same state when both functions are called. It's just that the function get_candidates does not make a hard enough check. So, stateful failover and rpc module calls will not work

filonenko-mikhail · 2021-12-20T14:13:28Z

get_candidates does not make a hard enough check

What kind of checks do you mean?

hackallcode · 2021-12-20T15:21:20Z

What kind of checks do you mean?

I don't remember a place with this check, but if you make same actions as in issue, you will get error message, generated by this check

There is a raice condition. We could commit config locally but it could be in progress on some instance. Before this patch user got unexpected "Role X unavailable" from instance where such role was assumed. Solution is an optimistic approach - detect config apply and try to wait untill it will be finished. Closes #1575

There is a raise condition. We could commit config locally but it could be in progress on some instance. Before this patch user got unexpected "Role X unavailable" from instance where such role was assumed. Solution is an optimistic approach - detect config apply and try to wait until it will be finished. Closes #1575

yngvar-antonsson added teamS Scaling bug Something isn't working labels Oct 15, 2021

olegrok added the customer label Oct 19, 2021

Steap2448 self-assigned this Oct 28, 2021

Steap2448 mentioned this issue Oct 29, 2021

Check config checksum mismatch during rpc.get_candidates #1588

Closed

3 tasks

filonenko-mikhail unassigned Steap2448 Dec 16, 2021

yngvar-antonsson closed this as completed Dec 17, 2021

hackallcode reopened this Dec 17, 2021

filonenko-mikhail added teamX and removed teamS Scaling labels Jan 12, 2022

yngvar-antonsson mentioned this issue Jan 22, 2022

Restrict RPC calls to work while clusterwide config #1717

Closed

2 tasks

olegrok mentioned this issue Jan 23, 2022

rpc: wait until config commit finishes if role is unavailable #1724

Merged

filonenko-mikhail closed this as completed in #1724 Jan 31, 2022

filonenko-mikhail added the 5sp label Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `rpc.get_candidates` function #1575

Fix `rpc.get_candidates` function #1575

hackallcode commented Oct 15, 2021 •

edited

Loading

Steap2448 commented Nov 18, 2021 •

edited

Loading

artur-barsegyan commented Dec 15, 2021

yngvar-antonsson commented Dec 17, 2021

hackallcode commented Dec 17, 2021

filonenko-mikhail commented Dec 20, 2021

hackallcode commented Dec 20, 2021

Fix rpc.get_candidates function #1575

Fix rpc.get_candidates function #1575

Comments

hackallcode commented Oct 15, 2021 • edited Loading

Steap2448 commented Nov 18, 2021 • edited Loading

artur-barsegyan commented Dec 15, 2021

yngvar-antonsson commented Dec 17, 2021

hackallcode commented Dec 17, 2021

filonenko-mikhail commented Dec 20, 2021

hackallcode commented Dec 20, 2021

Fix `rpc.get_candidates` function #1575

Fix `rpc.get_candidates` function #1575

hackallcode commented Oct 15, 2021 •

edited

Loading

Steap2448 commented Nov 18, 2021 •

edited

Loading