Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance SENTINEL FAILOVER to use the FAILOVER command to avoid data loss #1238

Closed

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Oct 29, 2024

Currently, SENTINEL FAILOVER selects a replica to send REPLICAOF NO ONE,
then waits for PROMOTION, and then sends REPLICAOF new_ip new_port to the
old primary and other replicas to complete the failover. The problem here
is that if the old primary has written to it before the role change during
the failover, these writes will be lost.

We can use the FAILOVER command to avoid this data loss. FAILOVER was added
in 0d18a1e, it can coordinates the failover
between the primary and the replica.

Before the original step REPLICAOF NO ONE, we try to send FAILOVER TO ip port
to the primary to go through the FAILOVER process. If the primary does not
support FAILOVER, for example, returns an error, it will fallback to the old
logic, that means fallback to REPLICAOF NO ONE.

@enjoy-binbin
Copy link
Member Author

This is a PR that was written very quickly (I wrote it at night and i want to push it as soon as possible). There may be problems with the details or it can be optimized.

Also see the redis/redis#6062 for more data loss details. The test case also proves data loss locally.

src/sentinel.c Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good.

I'm not enough familiar with Sentinel. @hwware you are a sentinel expert. Do you want to take a look?

src/networking.c Outdated Show resolved Hide resolved
src/sentinel.c Outdated Show resolved Hide resolved
tests/sentinel/tests/05-manual.tcl Outdated Show resolved Hide resolved
tests/sentinel/tests/05-manual.tcl Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 0% with 40 lines in your changes missing coverage. Please review.

Project coverage is 70.64%. Comparing base (c21f1dc) to head (21a6f4c).
Report is 26 commits behind head on unstable.

Files with missing lines Patch % Lines
src/sentinel.c 0.00% 40 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1238      +/-   ##
============================================
- Coverage     70.72%   70.64%   -0.08%     
============================================
  Files           114      114              
  Lines         63150    63195      +45     
============================================
- Hits          44660    44642      -18     
- Misses        18490    18553      +63     
Files with missing lines Coverage Δ
src/sentinel.c 0.00% <0.00%> (ø)

... and 12 files with indirect coverage changes

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 1, 2024
Signed-off-by: Binbin <binloveplay1314@qq.com>
@hwware
Copy link
Member

hwware commented Nov 1, 2024

Overall, LTGM, just one concern about the test case.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@gmbnomis
Copy link
Contributor

gmbnomis commented Nov 6, 2024

Some time ago, I proposed to use FAILOVER with SENTINEL FAILOVER in redis, see redis/redis#13118. I implemented this proposal and also ported my first implementation to Valkey some time ago, see gmbnomis@146850a.

The main issues I learned about when doing that were: (I think that this PR does not address these)

  1. We should follow the Sentinel protocol. Clients can expect to be disconnected when the role of the primary changes (see https://valkey.io/topics/sentinel-clients/). A simple FAILOVER command does not achieve that. Thus, clients may find themselves unable to write suddenly.
  2. FAILOVER pauses all client writes during the failover (in Valkey, it even pauses some client reads during the FAILOVER_IN_PROGRESS phase, see To avoid bouncing -REDIRECT during FAILOVER #871 (comment)). This means that Sentinels (remember that SENTINEL FAILOVER is performed without an election) will regard the former primary as down during the failover procedure and may initiate a failover. Also, if I remember correctly, the sentinel instance in charge of the failover may block its own connection to the primary.
  3. The FAILOVER command may hang indefinitely in the FAILOVER_IN_PROGRESS phase even if TIMEOUT is given. We should probably have a mechanism to detect such a condition and do a FAILOVER ABORT.

For reference: my proposal addresses 1. It addresses 2 somewhat (probably a solution seeking a leader election would be safer here) and does not address 3 (I remember that I started working on this but I did not finish it. The idea was to monitor the INFO field "master_failover_state" and send an FAILOVER ABORT if a node is in the state for too long. )

@enjoy-binbin
Copy link
Member Author

thanks for the detail comments.

  1. We should follow the Sentinel protocol. Clients can expect to be disconnected when the role of the primary changes (see https://valkey.io/topics/sentinel-clients/). A simple FAILOVER command does not achieve that. Thus, clients may find themselves unable to write suddenly.

Yes, we should. I remember this thing but nerver be able to found the documentation on it. thanks for the link.

  1. FAILOVER pauses all client writes during the failover (in Valkey, it even pauses some client reads during the FAILOVER_IN_PROGRESS phase, see To avoid bouncing -REDIRECT during FAILOVER #871 (comment)). This means that Sentinels (remember that SENTINEL FAILOVER is performed without an election) will regard the former primary as down during the failover procedure and may initiate a failover. Also, if I remember correctly, the sentinel instance in charge of the failover may block its own connection to the primary.

I did not read the whole story, i will find some times to catch up. Yes it seems you are right, all my thinking is in happy path. Maybe we should use FORCE option?

  1. The FAILOVER command may hang indefinitely in the FAILOVER_IN_PROGRESS phase even if TIMEOUT is given. We should probably have a mechanism to detect such a condition and do a FAILOVER ABORT.

I support in this, a FORCE option is needed, in this case, the worst case is the same as before.

For reference: my proposal addresses 1. It addresses 2 somewhat (probably a solution seeking a leader election would be safer here) and does not address 3 (I remember that I started working on this but I did not finish it. The idea was to monitor the INFO field "master_failover_state" and send an FAILOVER ABORT if a node is in the state for too long. )

@gmbnomis i did not read your proposal and implementation before, i will read it later, do you want to pick it up? You have done a lot of work here, it would be great that if you could pick it up, or we can slowly push it forward in the future and make all of this happen.

@enjoy-binbin enjoy-binbin marked this pull request as draft November 7, 2024 02:45
@gmbnomis
Copy link
Contributor

gmbnomis commented Nov 7, 2024

I did not read the whole story, i will find some times to catch up. Yes it seems you are right, all my thinking is in happy path. Maybe we should use FORCE option?

Using FAILOVER with the FORCE option would be more similar to the current SENTINEL FAILOVER in that we don't care about the replication state and just fail over at some point in time.

  1. The FAILOVER command may hang indefinitely in the FAILOVER_IN_PROGRESS phase even if TIMEOUT is given. We should probably have a mechanism to detect such a condition and do a FAILOVER ABORT.

I support in this, a FORCE option is needed, in this case, the worst case is the same as before.

The FORCE option of FAILOVER does not apply to the FAILOVER_IN_PROGRESS state, so we might get stuck as well and need some mechanism to get out of it. (My current thinking is that this does not need to be very efficient because the probability of this happening seems rather low. But the system should be able to recover from such a failure)

For reference: my proposal addresses 1. It addresses 2 somewhat (probably a solution seeking a leader election would be safer here) and does not address 3 (I remember that I started working on this but I did not finish it. The idea was to monitor the INFO field "master_failover_state" and send an FAILOVER ABORT if a node is in the state for too long. )

@gmbnomis i did not read your proposal and implementation before, i will read it later, do you want to pick it up? You have done a lot of work here, it would be great that if you could pick it up, or we can slowly push it forward in the future and make all of this happen.

@enjoy-binbin I am not completely sure, but you mean to pick up my old proposal and re-start working on it? As there seems to be interest in this feature I'd be happy to do that. (If so, how should I start? Create a new issue with an update of redis/redis#13118 to flesh out the implementation options further?)

I am also happy to support getting another implementation into shape.

@gmbnomis
Copy link
Contributor

@enjoy-binbin I have posted an updated description #1291 and implementation #1292 now. I think it addresses all three topics I mentioned above.

@enjoy-binbin
Copy link
Member Author

thanks for the work, i am closing this one, let move the discusstion to there

@enjoy-binbin enjoy-binbin deleted the sentinel_failover branch November 12, 2024 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants