Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes #939.

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes valkey-io#939.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 30, 2024
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.50%. Comparing base (bb57dfe) to head (6ab8888).
Report is 5 commits behind head on unstable.

Files with missing lines Patch % Lines
src/server.c 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
- Coverage     70.61%   70.50%   -0.11%     
============================================
  Files           114      114              
  Lines         61694    61714      +20     
============================================
- Hits          43564    43514      -50     
- Misses        18130    18200      +70     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.07% <100.00%> (-0.10%) ⬇️
src/config.c 78.69% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/server.c 88.64% <91.66%> (-0.03%) ⬇️

... and 10 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for doing this.

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

This is the sequence:

  1. SHUTDOWN or SIGTERM calls prepareForShutdown(). Here, pause clients for writing and start waiting for replicas offset.
  2. In serverCron(), we check isReadyToShutdown() which checks if all replicas have repl_ack_off == primary_repl_offset. If yes, finishShutdown() is called, otherwise wait more.
  3. finishShutdown.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

What do you think?

if (server.auto_failover_on_shutdown && server.cluster_enabled && best_replica) {
/* Sending a CLUSTER FAILOVER FORCE to the best replica. */
const char *buf = "*3\r\n$7\r\nCLUSTER\r\n$8\r\nFAILOVER\r\n$5\r\nFORCE\r\n";
if (connWrite(best_replica->conn, buf, strlen(buf)) == (int)strlen(buf)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in finishShutdown(), just before the primary does exit().

Is there a risk that the written command has not yet been fully sent to the replica when the primary exits? If not, can we do it in an earlier shutdown stage and make sure the replica has received the command? The replica doesn't send any OK reply to the primary on the replication stream, but maybe the primary could wait for the failover auth request from the replica on the cluster bus before the primary shuts down?

Btw, maybe this code should be in clusterHandleServerShutdown() which is called a few lines below:

/* Handle cluster-related matters when shutdown. */
if (server.cluster_enabled) clusterHandleServerShutdown();

Comment on lines +4440 to +4444
}

if (server.auto_failover_on_shutdown && server.cluster_enabled && !best_replica) {
serverLog(LL_WARNING, "Unable to find a replica to perform an auto failover on shutdown.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be an else to the above, so we avoid repeating the same logic.

Comment on lines +8 to +12

proc test_main {how} {
test "auto-failover-on-shutdown will always pick a best replica and send CLUSTER FAILOVER - $how" {
set primary [srv 0 client]
set replica1 [srv -3 client]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function requires some nodes to be primary and some to be replica. Please add some comment about how the cluster needs to be initiated for this function to work.

Comment on lines +32 to +34
# Wait for the replica2 to become a primary.
wait_for_condition 1000 50 {
[s -6 role] eq {master}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure we wait shorter time than the node timeout, so we actually test that this is faster than an automatic failover? This is 50 seconds, right? We can maybe reduce it to 10 seconds and we can set node timeout of the nodes to for example 20 seconds when we start them?

Comment on lines +4599 to +4603
/* todo: see if this is needed. */
/* This is a failover triggered by my primary, let's counts its vote. */
if (server.cluster->mf_is_primary_failover) {
server.cluster->failover_auth_count++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very good that we add some conditional logic about how we are counting the votes. It makes the voting algorithm more complex to analyze.

I understand we need this special case if the primary exits immediately after sending CLUSTER FAILOVER FORCE. Maybe we can avoid it if the primary waits for the failover auth requests and actually votes, before the primary shuts down..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NEW] Trigger manual failover on SIGTERM to primary (cluster)
2 participants