Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To avoid bouncing -REDIRECT during FAILOVER #871

Merged
merged 2 commits into from
Aug 14, 2024

Conversation

soloestoy
Copy link
Member

@soloestoy soloestoy commented Aug 5, 2024

Fix #821

During the FAILOVER process, when conditions are met (such as when the force time is reached or the primary and replica offsets are consistent), the primary actively becomes the replica and transitions to the FAILOVER_IN_PROGRESS state. After the primary becomes the replica, and after handshaking and other operations, it will eventually send the PSYNC FAILOVER command to the replica, after which the replica will become the primary. This means that the upgrade of the replica to the primary is an asynchronous operation, which implies that during the FAILOVER_IN_PROGRESS state, there may be a period of time where both nodes are replicas. In this scenario, if a -REDIRECT is returned, the request will be redirected to the replica and then redirected back, causing back and forth redirection. To avoid this situation, during the FAILOVER_IN_PROGRESS state, we temporarily suspend the clients that need to be redirected until the replica truly becomes the primary, and then resume the execution.

During the FAILOVER process, when conditions are met (such as
when the force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the FAILOVER_IN_PROGRESS state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
PSYNC FAILOVER command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
FAILOVER_IN_PROGRESS state, there may be a period of time where both
nodes are replicas. In this scenario, if a -REDIRECT is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
FAILOVER_IN_PROGRESS state, we temporarily suspend the clients that need
to be redirected until the replica truly becomes the primary, and then
resume the execution.

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
@soloestoy soloestoy requested a review from madolson August 5, 2024 09:59
Copy link

codecov bot commented Aug 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.18%. Comparing base (3cca268) to head (82894ee).
Report is 93 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #871      +/-   ##
============================================
- Coverage     70.35%   70.18%   -0.18%     
============================================
  Files           112      112              
  Lines         61310    61473     +163     
============================================
+ Hits          43136    43146      +10     
- Misses        18174    18327     +153     
Files with missing lines Coverage Δ
src/server.c 88.54% <100.00%> (-0.02%) ⬇️

... and 21 files with indirect coverage changes

Copy link
Contributor

@gmbnomis gmbnomis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, a potential test case for this bug is part of my PR #848 , see FIXME at https://github.com/valkey-io/valkey/pull/848/files#diff-fdef28e02ac8048541ebebbb829193ffd25622dc438bd8d921b167e51564bdd0R177. Just saying 😉

* FAILOVER_IN_PROGRESS state, we temporarily suspend the clients
* that need to be redirected until the replica truly becomes the primary,
* and then resume the execution. */
blockPostponeClient(c);
Copy link
Contributor

@gmbnomis gmbnomis Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may block reading clients as well, which is a behavioral change since we are only in CLIENT PAUSE WRITE during this phase.

The simplest solution I came up with was to add && server.failover_state == NO_FAILOVER to the if for the redirect case (in line 3903) and to the if for the read only replica case in https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4012.

The rationale for this is that during a failover, we should prefer to block clients (which will happen here https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4082) instead of redirecting them or giving them answers that may not be valid anymore after the failover (keep in mind that failover may still fail at this point in time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may block reading clients as well, which is a behavioral change since we are only in CLIENT PAUSE WRITE during this phase.

When the client does not execute the readonly command, read operations will also be redirected, so read operations also need to be suspended. This is a special state of FAILOVER_IN_PROGRESS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for this is that during a failover, we should prefer to block clients (which will happen here https://github.com/valkey-io/valkey/pull/871/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9R4082) instead of redirecting them or giving them answers that may not be valid anymore after the failover (keep in mind that failover may still fail at this point in time).

This is exactly what this PR does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the client does not execute the readonly command, read operations will also be redirected, so read operations also need to be suspended. This is a special state of FAILOVER_IN_PROGRESS.

If I understand you correctly, you are saying that reading (which is possible in both failover states up to now) is a problem during FAILOVER_IN_PROGRESS. Sorry, but I don't understand the reason why this is the case.

But if so, why are we only blocking clients that understand REDIRECT? Shouldn't we block all clients in this phase, then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both read and write commands may receive a -REDIRECT. The issue we are currently addressing is when the primary is demoted to a replica and is in the FAILOVER_IN_PROGRESS state, the replica may not have become the primary yet. The solution is to pause both read and write commands during FAILOVER_IN_PROGRESS, there is no need to pause non-read-write commands any time.

why are we only blocking clients that understand REDIRECT?

We don't care about clients without redirect capa, they would never receive -REDIRECT, instead they can receive -READONLY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seem to get my message across, but maybe I understand now why you consider FAILOVER_IN_PROGRESS to be special:

During the entire failover procedure, no change can happen on the primary since all writing commands are blocked (by postponing them). Therefore, I thought that continuing to answer reads is fine in FAILOVER_WAIT_FOR_SYNC as well as in FAILOVER_IN_PROGRESS.

But there is a time delay between the new primary becoming primary and the old primary realizing that the switch happened. If a new client happens to connect to the new primary and writes during this time, we may return stale data to a reading "READWRITE" client on the old primary if we allow reading instead of blocking it.

Is this why we need to block reads as well?

Do we need to document this change? (the documentation of the FAILOVER command does say that only writing clients are blocked)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to document this change? (the documentation of the FAILOVER command does say that only writing clients are blocked)

Yes, see valkey-io/valkey-doc#162

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. It would be nice to add a test though.

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
@soloestoy soloestoy added the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Aug 14, 2024
@soloestoy soloestoy merged commit 131857e into valkey-io:unstable Aug 14, 2024
47 checks passed
soloestoy added a commit to valkey-io/valkey-doc that referenced this pull request Aug 14, 2024
To valkey-io/valkey#871

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 21, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: mwish <maplewish117@gmail.com>
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 22, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: mwish <maplewish117@gmail.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
Fix #821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
Fix #821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 4, 2024
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] CLIENT CAPA redirect may result in bouncing redirects during failover
4 participants