Check whether to switch to fail when setting the node to pfail in cron #1061

enjoy-binbin · 2024-09-20T06:39:11Z

This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.

This may speed up the transition to the fail state a bit. Previously we would only check when we received a pfail/fail report from others in gossip. If myself is the last vote, we can directly switch to fail in here without waiting for the next gossip packet. Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2024-09-20T06:53:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.62%. Comparing base (d9c41e9) to head (f294c0b).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1061      +/-   ##
============================================
+ Coverage     70.59%   70.62%   +0.03%     
============================================
  Files           114      114              
  Lines         61673    61673              
============================================
+ Hits          43537    43557      +20     
+ Misses        18136    18116      -20

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.10% <100.00%> (-0.04%)`	⬇️

... and 16 files with indirect coverage changes

PingXie · 2024-09-20T07:15:45Z

src/cluster_legacy.c

@@ -5066,7 +5066,7 @@ void clusterCron(void) {
            if (!(node->flags & (CLUSTER_NODE_PFAIL | CLUSTER_NODE_FAIL))) {
                node->flags |= CLUSTER_NODE_PFAIL;
                update_state = 1;
-                if (server.cluster->size == 1 && clusterNodeIsVotingPrimary(myself)) {
+                if (clusterNodeIsVotingPrimary(myself)) {


This change makes a lot of sense to me. I actually don't fully understand why we special-cased single-shard clusters in b3aaa0a

Maybe it was just to reduce the impact (to solve a specific problem (single shard) at the time). @bentotten would you be able to take a review?

For single-shard clusters, there is no need to wait for a quorum of 1, so we can proceed to directly fail the node here and not wait for the original gossip-logic (for reference). Unfortunately prior to that, a single-shard cluster would never properly mark a replica as failed.

I believe we still need to wait to achieve a quorum from gossip to mark this node as truly failed, no? If the node uses the information it has from "old gossip" during each cluster cron run, wont this increase the number of calls to this function without increasing the accuracy of the failure report?

Though the benefit of the node being able to mark the replica as failed as soon as it knows it cant reach said replica in the last-needed-vote scenario also makes sense - @madolson I know you did the investigation into the original bug, if you have any insights. This was not a scenario I had considered for the original commit, and dont see any issue

right, thanks for the input, that is new to me. clusterNodeFailureReportsCount will cleanup the old failure report, so i think (hope) it can avoid the old gossip information risk. But you do make a good point, let see if others have a new thought.

Sorry, I don't have any useful input about this scenario. You know it better than me.

no worry, it is always good to have an extra eyes on it.

enjoy-binbin requested a review from madolson September 20, 2024 06:39

PingXie reviewed Sep 20, 2024

View reviewed changes

enjoy-binbin requested review from zuiderkwast and madolson and removed request for madolson November 14, 2024 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check whether to switch to fail when setting the node to pfail in cron #1061

Check whether to switch to fail when setting the node to pfail in cron #1061

enjoy-binbin commented Sep 20, 2024

codecov bot commented Sep 20, 2024

PingXie Sep 20, 2024

enjoy-binbin Sep 20, 2024

bentotten Sep 20, 2024

bentotten Sep 20, 2024 •

edited

Loading

enjoy-binbin Sep 21, 2024

zuiderkwast Nov 14, 2024

enjoy-binbin Nov 15, 2024

Check whether to switch to fail when setting the node to pfail in cron #1061

Are you sure you want to change the base?

Check whether to switch to fail when setting the node to pfail in cron #1061

Conversation

enjoy-binbin commented Sep 20, 2024

codecov bot commented Sep 20, 2024

Codecov Report

PingXie Sep 20, 2024

Choose a reason for hiding this comment

enjoy-binbin Sep 20, 2024

Choose a reason for hiding this comment

bentotten Sep 20, 2024

Choose a reason for hiding this comment

bentotten Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

enjoy-binbin Sep 21, 2024

Choose a reason for hiding this comment

zuiderkwast Nov 14, 2024

Choose a reason for hiding this comment

enjoy-binbin Nov 15, 2024

Choose a reason for hiding this comment

bentotten Sep 20, 2024 •

edited

Loading