Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cluster cant failover log conditions #780

Merged

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Jul 13, 2024

This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

  1. For the same cant_failover_reason, we will print the log once
    in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
    is a bit long, shorten it to 1s, so we can better track its state.
    We get to see the system making progress by watching the message.
    Using 1s also covers pretty much all cases as i don't see a reason
    for using a <1s node timeout, test or prod.

  2. We will not print logs before the nolog_fail_time, its value
    is cluster-node-timeout+5000. This may casue us to lose some logs,
    for example, if cluster-node-timeout is small, auth_timeout will
    be 2000, and auth_retry_time will be 4000. In this case, we will
    lose all the reasons during the election if the failover is timedout.
    So remove the nolog_fail_time logic, since we still do have the
    CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
    logs.

This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 5s, so we can better track its state.

2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, We won't print too many
logs.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link

codecov bot commented Jul 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.05%. Comparing base (a4ee8da) to head (9e050f8).
Report is 72 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #780      +/-   ##
============================================
+ Coverage     70.04%   70.05%   +0.01%     
============================================
  Files           112      112              
  Lines         60602    60587      -15     
============================================
- Hits          42447    42445       -2     
+ Misses        18155    18142      -13     
Files Coverage Δ
src/cluster_legacy.c 85.97% <ø> (+0.18%) ⬆️

... and 15 files with indirect coverage changes

src/cluster_legacy.h Outdated Show resolved Hide resolved
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit 380f700 into valkey-io:unstable Aug 6, 2024
20 checks passed
@enjoy-binbin enjoy-binbin deleted the remove_nolog_fail_time branch August 6, 2024 13:14
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 22, 2024
This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.

2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Aug 22, 2024
madolson pushed a commit that referenced this pull request Sep 2, 2024
This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.

2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.

1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.

2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants