Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To avoid bouncing -REDIRECT during FAILOVER #871

Merged
merged 2 commits into from
Aug 14, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion src/server.c
Original file line number Diff line number Diff line change
Expand Up @@ -3902,7 +3902,30 @@ int processCommand(client *c) {

if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && !mustObeyClient(c) &&
(is_write_command || (is_read_command && !c->flag.readonly))) {
addReplyErrorSds(c, sdscatprintf(sdsempty(), "-REDIRECT %s:%d", server.primary_host, server.primary_port));
if (server.failover_state == FAILOVER_IN_PROGRESS) {
/* During the FAILOVER process, when conditions are met (such as
* when the force time is reached or the primary and replica offsets
* are consistent), the primary actively becomes the replica and
* transitions to the FAILOVER_IN_PROGRESS state.
*
* After the primary becomes the replica, and after handshaking
* and other operations, it will eventually send the PSYNC FAILOVER
* command to the replica, then the replica will become the primary.
* This means that the upgrade of the replica to the primary is an
* asynchronous operation, which implies that during the
* FAILOVER_IN_PROGRESS state, there may be a period of time where
* both nodes are replicas.
*
* In this scenario, if a -REDIRECT is returned, the request will be
* redirected to the replica and then redirected back, causing back
* and forth redirection. To avoid this situation, during the
* FAILOVER_IN_PROGRESS state, we temporarily suspend the clients
* that need to be redirected until the replica truly becomes the primary,
* and then resume the execution. */
blockPostponeClient(c);
Copy link
Contributor

@gmbnomis gmbnomis Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may block reading clients as well, which is a behavioral change since we are only in CLIENT PAUSE WRITE during this phase.

The simplest solution I came up with was to add && server.failover_state == NO_FAILOVER to the if for the redirect case (in line 3903) and to the if for the read only replica case in https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4012.

The rationale for this is that during a failover, we should prefer to block clients (which will happen here https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4082) instead of redirecting them or giving them answers that may not be valid anymore after the failover (keep in mind that failover may still fail at this point in time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may block reading clients as well, which is a behavioral change since we are only in CLIENT PAUSE WRITE during this phase.

When the client does not execute the readonly command, read operations will also be redirected, so read operations also need to be suspended. This is a special state of FAILOVER_IN_PROGRESS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for this is that during a failover, we should prefer to block clients (which will happen here https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4082) instead of redirecting them or giving them answers that may not be valid anymore after the failover (keep in mind that failover may still fail at this point in time).

This is exactly what this PR does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the client does not execute the readonly command, read operations will also be redirected, so read operations also need to be suspended. This is a special state of FAILOVER_IN_PROGRESS.

If I understand you correctly, you are saying that reading (which is possible in both failover states up to now) is a problem during FAILOVER_IN_PROGRESS. Sorry, but I don't understand the reason why this is the case.

But if so, why are we only blocking clients that understand REDIRECT? Shouldn't we block all clients in this phase, then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both read and write commands may receive a -REDIRECT. The issue we are currently addressing is when the primary is demoted to a replica and is in the FAILOVER_IN_PROGRESS state, the replica may not have become the primary yet. The solution is to pause both read and write commands during FAILOVER_IN_PROGRESS, there is no need to pause non-read-write commands any time.

why are we only blocking clients that understand REDIRECT?

We don't care about clients without redirect capa, they would never receive -REDIRECT, instead they can receive -READONLY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seem to get my message across, but maybe I understand now why you consider FAILOVER_IN_PROGRESS to be special:

During the entire failover procedure, no change can happen on the primary since all writing commands are blocked (by postponing them). Therefore, I thought that continuing to answer reads is fine in FAILOVER_WAIT_FOR_SYNC as well as in FAILOVER_IN_PROGRESS.

But there is a time delay between the new primary becoming primary and the old primary realizing that the switch happened. If a new client happens to connect to the new primary and writes during this time, we may return stale data to a reading "READWRITE" client on the old primary if we allow reading instead of blocking it.

Is this why we need to block reads as well?

Do we need to document this change? (the documentation of the FAILOVER command does say that only writing clients are blocked)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to document this change? (the documentation of the FAILOVER command does say that only writing clients are blocked)

Yes, see valkey-io/valkey-doc#162

} else {
addReplyErrorSds(c, sdscatprintf(sdsempty(), "-REDIRECT %s:%d", server.primary_host, server.primary_port));
}
return C_OK;
}

Expand Down
Loading