Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong calcuration of retry interval and detecting retry limit #3649

Merged
merged 8 commits into from
Mar 28, 2022

Conversation

ashie
Copy link
Member

@ashie ashie commented Mar 1, 2022

Which issue(s) this PR fixes:
Fixes #3609

What this PR does / why we need it:

The correct total retry time should be:

c + c * b^1 + (...) + c*b^(k - 1)

But the previous implementation was:

c + c * b^0 + c * b^1 + (...) + c*b^(k - 1)    

where:

  • c: constant factor, @retry_wait
  • b: base factor, @retry_exponential_backoff_base
  • k: number of retry times, @max_retry_times

Although the first retry interval is already added at constructor of
ExponentialBackoffRetry, same value is added again unexpectedly on the
first calc_interval call (as c * b^0).

In addition, this PR also fixes wrong detection of retry limit.

Docs Changes:
https://docs.fluentd.org/configuration/buffer-section#retries-parameters is also wrong, should be fixed too.

Release Note:
Same with the title

@ashie ashie marked this pull request as ready for review March 2, 2022 05:19
@ashie ashie requested a review from daipom March 2, 2022 05:20
@ashie
Copy link
Member Author

ashie commented Mar 2, 2022

We are now checking both actual retry process & calculating max retry timeout...

@daipom
Copy link
Contributor

daipom commented Mar 2, 2022

Based on the current behavior, the documentation is wrong, and the previous patch is more correct.
(Strictly speaking, that's not perfectly correct too.)

The current retry behavior is strange and makes this problem more difficult.
The retry specification and implementation may need to be fundamentally reorganized.

I looked at the implementation and checked the behavior.
The behavior is following, which is different from the documentation(buffer, configuration-buffer-section).

  • retry_max_times: k
  • retry_wait: c (default: 1s)
  • retry_exponential_backoff_base: b (default: 2)
fail flush
step: 0, interval: c
fail flush (1th retry)
step: 1, interval: c
fail flush (2th retry)
step: 2, interval: c*b
fail flush (3th retry)
step: 3, interval: c*b^2
...
fail flush (k th retry)
step: k, interval: c*b^(k-1) [s]
fail flush (k+1 th retry)
retry limit!!

So total retry time is c + c + c*b + c*b^2 + (...) + c*b^(k-1).
The calculation of the prev patch is almost correct but only c smaller.
(By default, it is 1sec.)

However, it is questionable whether this retry behavior is correct in the first place.

  • The retry is eventually executed k+1 times. (The total number of flush will be k+2 attempts.)
  • The interval to the first retry and the interval to the second retry are the same time.

@ashie
Copy link
Member Author

ashie commented Mar 2, 2022

Thanks. I'll also check validity of current implementation & behaviour.

@daipom
Copy link
Contributor

daipom commented Mar 2, 2022

I used the following match setting.

<match test.**>
  @type stdout
  <buffer>
    flush_mode interval
    flush_interval 1
    retry_wait 3
    retry_randomize false
    retry_max_times 3
  </buffer>
</match>

I changed out_stdout code for debug.

    def write(chunk)
      $log.info("!!!FLUSH!!!")
      raise "For Debug"
    end

Then I got the following fluentd log.

2022-03-02 15:49:53 +0900 [info]: #0 fluent/log.rb:330:info: !!!FLUSH!!!
2022-03-02 15:49:53 +0900 [debug]: #0 fluent/log.rb:309:debug: taking back chunk for errors. chunk="5d936b031c4d50ef7aaec68b0bfe7a10"
2022-03-02 15:49:53 +0900 [warn]: #0 fluent/log.rb:351:warn: failed to flush the buffer. retry_times=0 next_retry_time=2022-03-02 15:49:56 +0900 chunk="5d936b031c4d50ef7aaec68b0bfe7a10" error_class=RuntimeError error="For Debug"
  ...

2022-03-02 15:49:56 +0900 [info]: #0 fluent/log.rb:330:info: !!!FLUSH!!!
2022-03-02 15:49:56 +0900 [debug]: #0 fluent/log.rb:309:debug: taking back chunk for errors. chunk="5d936b031c4d50ef7aaec68b0bfe7a10"
2022-03-02 15:49:56 +0900 [warn]: #0 fluent/log.rb:351:warn: failed to flush the buffer. retry_times=1 next_retry_time=2022-03-02 15:49:59 +0900 chunk="5d936b031c4d50ef7aaec68b0bfe7a10" error_class=RuntimeError error="For Debug"
  2022-03-02 15:49:56 +0900 [warn]: #0 plugin/output.rb:1235:rescue in try_flush: suppressed same stacktrace

2022-03-02 15:49:59 +0900 [info]: #0 fluent/log.rb:330:info: !!!FLUSH!!!
2022-03-02 15:49:59 +0900 [debug]: #0 fluent/log.rb:309:debug: taking back chunk for errors. chunk="5d936b031c4d50ef7aaec68b0bfe7a10"
2022-03-02 15:49:59 +0900 [warn]: #0 fluent/log.rb:351:warn: failed to flush the buffer. retry_times=2 next_retry_time=2022-03-02 15:50:05 +0900 chunk="5d936b031c4d50ef7aaec68b0bfe7a10" error_class=RuntimeError error="For Debug"
  2022-03-02 15:49:59 +0900 [warn]: #0 plugin/output.rb:1235:rescue in try_flush: suppressed same stacktrace

2022-03-02 15:50:05 +0900 [info]: #0 fluent/log.rb:330:info: !!!FLUSH!!!
2022-03-02 15:50:05 +0900 [debug]: #0 fluent/log.rb:309:debug: taking back chunk for errors. chunk="5d936b031c4d50ef7aaec68b0bfe7a10"
2022-03-02 15:50:05 +0900 [warn]: #0 fluent/log.rb:351:warn: failed to flush the buffer. retry_times=3 next_retry_time=2022-03-02 15:50:17 +0900 chunk="5d936b031c4d50ef7aaec68b0bfe7a10" error_class=RuntimeError error="For Debug"
  2022-03-02 15:50:05 +0900 [warn]: #0 plugin/output.rb:1235:rescue in try_flush: suppressed same stacktrace

2022-03-02 15:50:17 +0900 [info]: #0 fluent/log.rb:330:info: !!!FLUSH!!!
2022-03-02 15:50:17 +0900 [debug]: #0 fluent/log.rb:309:debug: taking back chunk for errors. chunk="5d936b031c4d50ef7aaec68b0bfe7a10"
2022-03-02 15:50:17 +0900 [error]: #0 fluent/log.rb:372:error: failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue. retry_times=3 records=1 error_class=RuntimeError error="For Debug"
  2022-03-02 15:50:17 +0900 [error]: #0 plugin/output.rb:1235:rescue in try_flush: suppressed same stacktrace
2022-03-02 15:50:17 +0900 [debug]: #0 fluent/log.rb:309:debug: buffer queue cleared

@ashie ashie marked this pull request as draft March 2, 2022 08:23
@daipom
Copy link
Contributor

daipom commented Mar 23, 2022

I think we should fix the logic of limit? timing too in order to fix k+1 times retry attempts(to k times).

diff --git a/lib/fluent/plugin/output.rb b/lib/fluent/plugin/output.rb
index 54dbbeb5..4056f7af 100644
--- a/lib/fluent/plugin/output.rb
+++ b/lib/fluent/plugin/output.rb
@@ -1284,20 +1284,6 @@ module Fluent
               end
               return
             end
-          end
-
-          # @retry exists
-
-          if @retry.limit?
-            if error
-              records = @buffer.queued_records
-              msg = "failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue."
-              log.error msg, retry_times: @retry.steps, records: records, error: error
-              log.error_backtrace error.backtrace
-            end
-            @buffer.clear_queue!
-            log.debug "buffer queue cleared"
-            @retry = nil
           else
             # Ensure that the current time is greater than or equal to @retry.next_time to avoid the situation when
             # @retry.step is called almost as many times as the number of flush threads in a short time.
@@ -1306,7 +1292,7 @@ module Fluent
             else
               @retry.recalc_next_time # to prevent all flush threads from retrying at the same time
             end
-            if error
+            if error && !@retry.limit?
               if using_secondary
                 msg = "failed to flush the buffer with secondary output."
                 log.warn msg, retry_times: @retry.steps, next_retry_time: @retry.next_time.round, chunk: chunk_id_hex, error: error
@@ -1318,6 +1304,18 @@ module Fluent
               end
             end
           end
+
+          if @retry.limit?
+            if error
+              records = @buffer.queued_records
+              msg = "failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue."
+              log.error msg, retry_times: @retry.steps, records: records, error: error
+              log.error_backtrace error.backtrace
+            end
+            @buffer.clear_queue!
+            log.debug "buffer queue cleared"
+            @retry = nil
+          end
         end
       end
 
diff --git a/lib/fluent/plugin_helper/retry_state.rb b/lib/fluent/plugin_helper/retry_state.rb
index 131f9eb6..7a7f2b22 100644
--- a/lib/fluent/plugin_helper/retry_state.rb
+++ b/lib/fluent/plugin_helper/retry_state.rb
@@ -158,7 +158,7 @@ module Fluent
 
         def calc_max_retry_timeout(max_steps)
           result = 0
-          (0..max_steps).each { |i|
+          max_steps.times { |i|
             result += calc_interval(i)
           }
           result

@ashie
Copy link
Member Author

ashie commented Mar 23, 2022

Yea, it's also needed. In addition, some comments in my commits and some tests are still wrong. I'm now addressing them too.

The correct total retry time should be:
  c + c * b^1 + (...) + c*b^(k - 1)

But the previous implementation was:
  c + c * b^0 + c * b^1 + (...) + c*b^(k - 1)

where:

  * c: constant factor, @retry_wait
  * b: base factor, @retry_exponential_backoff_base
  * k: number of retry times, @max_retry_times

Although the first retry interval is already added at constructor of
ExponentialBackoffRetry, same value is added again unexpectedly on the
first calc_interval call (as `c * b^0`).

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie ashie changed the title Fix wrong max retry timeout for exponential backoff again retry_state: Fix wrong calcuration of interval Mar 23, 2022
@ashie
Copy link
Member Author

ashie commented Mar 23, 2022

I think we should fix the logic of limit? timing too in order to fix k+1 times retry attempts(to k times).

Hmm, this patch breaks retry_timeout, further work is needed...

ashie added 3 commits March 23, 2022 17:38
The previous implementation detects @max_retry_times + 1 as limit,
but the limit should be @max_retry_times.

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie ashie changed the title retry_state: Fix wrong calcuration of interval Fix wrong calcuration of retry interval and retry limit Mar 23, 2022
@ashie ashie changed the title Fix wrong calcuration of retry interval and retry limit Fix wrong calcuration of retry interval and detecting retry limit Mar 23, 2022
@ashie ashie marked this pull request as ready for review March 23, 2022 09:37
@daipom
Copy link
Contributor

daipom commented Mar 24, 2022

I have created PR to add scenario tests: #3685

RetryStatement behavior is complicated, so scenario tests may be useful.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom
Copy link
Contributor

daipom commented Mar 24, 2022

As for timeout, how about managing the status this way?
The number of states will increase, but the helper will be easier to use for the plugins.

This diff make some tests fail, I'm checking it.

diff --git a/lib/fluent/plugin_helper/retry_state.rb b/lib/fluent/plugin_helper/retry_state.rb
index 29bcf8c1..8816773d 100644
--- a/lib/fluent/plugin_helper/retry_state.rb
+++ b/lib/fluent/plugin_helper/retry_state.rb
@@ -44,6 +44,8 @@ module Fluent
 
           @timeout = timeout
           @timeout_at = @start + timeout
+          @has_reached_timeout = false
+          @has_timeouted = false
           @current = :primary
 
           if randomize_width < 0 || randomize_width > 0.5
@@ -123,7 +125,15 @@ module Fluent
             @current = :secondary
             @secondary_transition_steps = @steps
           end
+
           @next_time = calc_next_time
+
+          unless @has_reached_timeout
+            @has_reached_timeout = @next_time >= @timeout_at
+          else
+            @has_timeouted = @next_time >= @timeout_at
+          end
+
           nil
         end
 
@@ -131,24 +141,11 @@ module Fluent
           @next_time = calc_next_time
         end
 
-        # Use @next_time for time by default to keep backward compatibility
-        def limit?(time: @next_time, steps: @steps)
-          timeout?(time) || limit_step?(steps)
-        end
-
-        def timeout?(time = current_time)
-          if @forever
-            false
-          else
-            time >= @timeout_at
-          end
-        end
-
-        def limit_step?(steps = @steps)
+        def limit?
           if @forever
             false
           else
-            !!(@max_steps && steps >= @max_steps)
+            @has_timeouted || !!(@max_steps && @steps >= @max_steps)
           end
         end
       end

@daipom
Copy link
Contributor

daipom commented Mar 24, 2022

Have to fix scenario test to test this diff.

diff --git a/test/plugin_helper/test_retry_state.rb b/test/plugin_helper/test_retry_state.rb
index 824151c3..f44f069c 100644
--- a/test/plugin_helper/test_retry_state.rb
+++ b/test/plugin_helper/test_retry_state.rb
@@ -988,15 +988,9 @@ class RetryStateHelperTest < Test::Unit::TestCase
         msg << "[#{next_elapsed}s elapsed point] #{retry_count}th-Retry(#{s.secondary? ? "SEC" : "PRI"}) is triggered.\n"
 
         # Update retry statement
-        if s.limit?
-          msg << "--- Reach limit of timeout. ---\n"
-          break
-        end
-
         s.step
-
-        if s.limit_step?
-          msg << "--- Reach limit of max step. ---\n"
+        if s.limit?
+          msg << "--- Reach limit. ---\n"
           break
         end
       end

@daipom
Copy link
Contributor

daipom commented Mar 24, 2022

As for timeout, how about managing the status this way?
The number of states will increase, but the helper will be easier to use for the plugins.

I have created PR: #3686

I have confirmed the following test success.

  • $ bundle exec rake test TESTOPTS="-t'BufferedOutputRetryTest'"
  • $ bundle exec rake test TESTOPTS="-t'RetryStateHelperTest'"

daipom added 2 commits March 24, 2022 14:42
Add 2 states to control timeout.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom
Copy link
Contributor

daipom commented Mar 24, 2022

Thank you for the merge!

@daipom
Copy link
Contributor

daipom commented Mar 25, 2022

It's complicated how the behavior changes in the end.
I have sorted it out.

retry_timeout: 100

commit: 8ced42c

[1s elapsed point] 1th-Retry(PRI) is triggered.
[2s elapsed point] 2th-Retry(PRI) is triggered.
[4s elapsed point] 3th-Retry(PRI) is triggered.
[8s elapsed point] 4th-Retry(PRI) is triggered.
[16s elapsed point] 5th-Retry(PRI) is triggered.
[32s elapsed point] 6th-Retry(PRI) is triggered.
[64s elapsed point] 7th-Retry(PRI) is triggered.
[80s elapsed point] 8th-Retry(SEC) is triggered.
[81s elapsed point] 9th-Retry(SEC) is triggered.
[83s elapsed point] 10th-Retry(SEC) is triggered.
[87s elapsed point] 11th-Retry(SEC) is triggered.
[95s elapsed point] 12th-Retry(SEC) is triggered.
[100s elapsed point] 13th-Retry(SEC) is triggered.

commit: c759ebe

[1s elapsed point] 1th-Retry(PRI) is triggered.
[2s elapsed point] 2th-Retry(PRI) is triggered.
[4s elapsed point] 3th-Retry(PRI) is triggered.
[8s elapsed point] 4th-Retry(PRI) is triggered.
[16s elapsed point] 5th-Retry(PRI) is triggered.
[32s elapsed point] 6th-Retry(PRI) is triggered.
[64s elapsed point] 7th-Retry(PRI) is triggered.
[80s elapsed point] 8th-Retry(SEC) is triggered.
[81s elapsed point] 9th-Retry(SEC) is triggered.
[83s elapsed point] 10th-Retry(SEC) is triggered.
[87s elapsed point] 11th-Retry(SEC) is triggered.
[95s elapsed point] 12th-Retry(SEC) is triggered.
[100s elapsed point] 13th-Retry(SEC) is triggered.

This fix

[1s elapsed point] 1th-Retry(PRI) is triggered.
[3s elapsed point] 2th-Retry(PRI) is triggered.
[7s elapsed point] 3th-Retry(PRI) is triggered.
[15s elapsed point] 4th-Retry(PRI) is triggered.
[31s elapsed point] 5th-Retry(PRI) is triggered.
[63s elapsed point] 6th-Retry(PRI) is triggered.
[80s elapsed point] 7th-Retry(SEC) is triggered.
[81s elapsed point] 8th-Retry(SEC) is triggered.
[83s elapsed point] 9th-Retry(SEC) is triggered.
[87s elapsed point] 10th-Retry(SEC) is triggered.
[95s elapsed point] 11th-Retry(SEC) is triggered.
[100s elapsed point] 12th-Retry(SEC) is triggered.


retry_max_times: 10

commit: 8ced42c

[1s elapsed point] 1th-Retry(PRI) is triggered.
[2s elapsed point] 2th-Retry(PRI) is triggered.
[4s elapsed point] 3th-Retry(PRI) is triggered.
[8s elapsed point] 4th-Retry(PRI) is triggered.
[16s elapsed point] 5th-Retry(PRI) is triggered.
[32s elapsed point] 6th-Retry(PRI) is triggered.
[64s elapsed point] 7th-Retry(PRI) is triggered.
[128s elapsed point] 8th-Retry(PRI) is triggered.
[256s elapsed point] 9th-Retry(PRI) is triggered.
[409s elapsed point] 10th-Retry(SEC) is triggered.
[410s elapsed point] 11th-Retry(SEC) is triggered.

NOTE: secondary: false case is the following.

[1s elapsed point] 1th-Retry(PRI) is triggered.
[2s elapsed point] 2th-Retry(PRI) is triggered.
[4s elapsed point] 3th-Retry(PRI) is triggered.
[8s elapsed point] 4th-Retry(PRI) is triggered.
[16s elapsed point] 5th-Retry(PRI) is triggered.
[32s elapsed point] 6th-Retry(PRI) is triggered.
[64s elapsed point] 7th-Retry(PRI) is triggered.
[128s elapsed point] 8th-Retry(PRI) is triggered.
[256s elapsed point] 9th-Retry(PRI) is triggered.
[512s elapsed point] 10th-Retry(PRI) is triggered.
[1024s elapsed point] 11th-Retry(PRI) is triggered.

commit: c759ebe

[1s elapsed point] 1th-Retry(PRI) is triggered.
[2s elapsed point] 2th-Retry(PRI) is triggered.
[4s elapsed point] 3th-Retry(PRI) is triggered.
[8s elapsed point] 4th-Retry(PRI) is triggered.
[16s elapsed point] 5th-Retry(PRI) is triggered.
[32s elapsed point] 6th-Retry(PRI) is triggered.
[64s elapsed point] 7th-Retry(PRI) is triggered.
[128s elapsed point] 8th-Retry(PRI) is triggered.
[256s elapsed point] 9th-Retry(PRI) is triggered.
[512s elapsed point] 10th-Retry(PRI) is triggered.
[818s elapsed point] 11th-Retry(SEC) is triggered.

This fix

[1s elapsed point] 1th-Retry(PRI) is triggered.
[3s elapsed point] 2th-Retry(PRI) is triggered.
[7s elapsed point] 3th-Retry(PRI) is triggered.
[15s elapsed point] 4th-Retry(PRI) is triggered.
[31s elapsed point] 5th-Retry(PRI) is triggered.
[63s elapsed point] 6th-Retry(PRI) is triggered.
[127s elapsed point] 7th-Retry(PRI) is triggered.
[255s elapsed point] 8th-Retry(PRI) is triggered.
[511s elapsed point] 9th-Retry(PRI) is triggered.
[818s elapsed point] 10th-Retry(SEC) is triggered.

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
daipom added a commit to daipom/fluentd-docs-gitbook that referenced this pull request Mar 25, 2022
See also: fluent/fluentd#3649

Before the fix and fluent/fluentd#3640 ,
the both of retry behavior and the document were wrong.

Actual behavior: `c + c + cb^1 + ... + cb^(k-1)` (k+1 retries totally)
Total calculation: `cb^(-1) + c + cb^1 + ... + cb^(k-2)`

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom
Copy link
Contributor

daipom commented Mar 25, 2022

I have created the document fix: fluent/fluentd-docs-gitbook#395

@daipom
Copy link
Contributor

daipom commented Mar 25, 2022

a4f15f9

Cool! Now the logic of update_retry_state becomes very clear!!

@ashie ashie merged commit 2da6e1b into master Mar 28, 2022
@ashie ashie deleted the issue-3609-2 branch March 28, 2022 00:59
@ashie ashie added this to the v1.14.6 milestone Mar 30, 2022
daipom added a commit to daipom/fluentd-docs-gitbook that referenced this pull request Mar 31, 2022
See also: fluent/fluentd#3649

Before the fix, the number was mistakenly 19.
This fixes it to 18.
(The document was also wrong.)

n-th retry is triggered at `2^n - 1` seconds elapsed since the first flush.
So 18-th retry is at 262143 seconds elapsed and this exceeds 72 hours (259200 seconds).
If the next retry is going to exceed this time limit, the last retry will be
made at exactly this time limit.
So the last time is 18.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exponential backoff is not calculated right
2 participants