-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check whether to switch to fail when setting the node to pfail in cron #1061
base: unstable
Are you sure you want to change the base?
Conversation
This may speed up the transition to the fail state a bit. Previously we would only check when we received a pfail/fail report from others in gossip. If myself is the last vote, we can directly switch to fail in here without waiting for the next gossip packet. Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1061 +/- ##
============================================
+ Coverage 70.59% 70.62% +0.03%
============================================
Files 114 114
Lines 61673 61673
============================================
+ Hits 43537 43557 +20
+ Misses 18136 18116 -20
|
@@ -5066,7 +5066,7 @@ void clusterCron(void) { | |||
if (!(node->flags & (CLUSTER_NODE_PFAIL | CLUSTER_NODE_FAIL))) { | |||
node->flags |= CLUSTER_NODE_PFAIL; | |||
update_state = 1; | |||
if (server.cluster->size == 1 && clusterNodeIsVotingPrimary(myself)) { | |||
if (clusterNodeIsVotingPrimary(myself)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes a lot of sense to me. I actually don't fully understand why we special-cased single-shard clusters in b3aaa0a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it was just to reduce the impact (to solve a specific problem (single shard) at the time). @bentotten would you be able to take a review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For single-shard clusters, there is no need to wait for a quorum of 1, so we can proceed to directly fail the node here and not wait for the original gossip-logic (for reference). Unfortunately prior to that, a single-shard cluster would never properly mark a replica as failed.
I believe we still need to wait to achieve a quorum from gossip to mark this node as truly failed, no? If the node uses the information it has from "old gossip" during each cluster cron run, wont this increase the number of calls to this function without increasing the accuracy of the failure report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the benefit of the node being able to mark the replica as failed as soon as it knows it cant reach said replica in the last-needed-vote scenario also makes sense - @madolson I know you did the investigation into the original bug, if you have any insights. This was not a scenario I had considered for the original commit, and dont see any issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, thanks for the input, that is new to me. clusterNodeFailureReportsCount will cleanup the old failure report, so i think (hope) it can avoid the old gossip information risk. But you do make a good point, let see if others have a new thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't have any useful input about this scenario. You know it better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worry, it is always good to have an extra eyes on it.
This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.