Retry on network failures #4454

pierDipi · 2020-11-03T12:09:05Z

Proposed Changes

Retry on network failures

Release Note

- 🐛 Fix bug
Retry on network failures

Test execution logs:

2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 0s (10 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 100ms (9 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 200ms (8 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 300ms (7 left)
2020/11/03 13:25:28 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:28 [DEBUG] POST http://127.0.0.1:21697: retrying in 400ms (6 left)
2020/11/03 13:25:28 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 500ms (5 left)
2020/11/03 13:25:29 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 600ms (4 left)
2020/11/03 13:25:29 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 700ms (3 left)
2020/11/03 13:25:30 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 800ms (2 left)
--- PASS: TestRetriesOnNetworkErrors (3.61s)
PASS

Process finished with exit code 0

knative-prow-robot · 2020-11-03T12:09:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pierDipi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2020-11-03T12:15:49Z

Codecov Report

Merging #4454 into master will increase coverage by 0.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4454      +/-   ##
==========================================
+ Coverage   81.06%   81.19%   +0.12%     
==========================================
  Files         281      281              
  Lines        7981     7981              
==========================================
+ Hits         6470     6480      +10     
+ Misses       1122     1112      -10     
  Partials      389      389

Impacted Files	Coverage Δ
pkg/kncloudevents/message_sender.go	`79.66% <100.00%> (+16.94%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f064837...722560c. Read the comment docs.

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

pierDipi · 2020-11-03T12:29:08Z

/cc @slinkydeveloper

slinkydeveloper · 2020-11-03T13:31:14Z

pkg/kncloudevents/message_sender.go

-func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) {
-	return resp != nil && resp.StatusCode >= 300, nil
+func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) {
+	return !(resp != nil && resp.StatusCode < 300), err


IMO we should not retry on any error... Some of them are reasonable to not retry, like dial tcp timeout when the tcp address it's not reachable, or dns errors... My feeling is that it should be configurable which errors to retry and which not, because otherwise you bloat the dispatchers always in retrying, when there is no need (because maybe the pointed service doesn't exist at all)....

What are people feelings about that? @matzew, @vaikas, @lionelvillard. @grantr ?

I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.

On k8s, pods may be started and deleted for many reasons (scale-up/down, node drain, node restart, ...), and not all addressables will implement proper readiness and graceful shutdown, so it may be quite common to get all kinds of errors during normal operation, and IMHO user will expect that setting "retries" will make the dispatcher retry in these kinds of situations...

Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)

Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)

Well, but this ends up in the discussions of "exactly once", which atm is out of scope of eventing

I think retrying is a reasonable default, then if there are interests for specific configurations we can always add them.

I agree with @slinkydeveloper that not every error should be retried. fwiw... this is the logic in the "distributed" KafkaChannel dispatcher for determining when to retry. Also, we only retry some of the following because of the broker munging the actual responses(400), and the retry test expecting them(429)...

// // Note - Normally we would NOT want to retry 400 responses, BUT the knative-eventing // filter handler (due to CloudEvents SDK V1 usage) is swallowing the actual // status codes from the subscriber and returning 400s instead. Once this has, // been resolved we can remove 400 from the list of codes to retry. // if statusCode >= 500 || statusCode == 400 || statusCode == 404 || statusCode == 429 { logger.Warn("Failed To Send Message To Subscriber Service - Retrying") return true, nil } else if statusCode >= 300 && statusCode <= 399 { logger.Warn("Failed To Send Message To Subscriber Service - Not Retrying") return false, nil } else if statusCode == -1 { logger.Warn("No StatusCode Detected In Error - Retrying") return true, nil } // Do Not Retry 1XX, 2XX, & Most 4XX StatusCode Responses return false, nil

We're discussing network errors, not status codes, for status codes, there is a separate issue: #2411

+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.

+1 for retry in error.

I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.

that would be my thinking here too

zhongduo · 2020-11-03T15:30:06Z

pkg/kncloudevents/message_sender.go

-func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) {
-	return resp != nil && resp.StatusCode >= 300, nil
+func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) {
+	return !(resp != nil && resp.StatusCode < 300), err


+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.

pkg/kncloudevents/message_sender_test.go

zhongduo · 2020-11-03T15:43:10Z

pkg/kncloudevents/message_sender_test.go

+func TestRetriesOnNetworkErrors(t *testing.T) {
+
+	port := rand.Int31n(math.MaxUint16-1024) + 1024
+	n := int32(10)


I don't think you need all the int32.

I use n to set DeliverySpec.Retry which is an int32 and then I compare it with nCalls, so instead of a cast when I compare them I just use the same type for both, does that make sense?

Ah, makes sense. I thought you are just doing that in n and nCalls. I personally would do the cast once in Int32Ptr, but it is just me.

What's the approximate run time for this test? I think recently the linear backoff is changed to: t, 2t, 3t, 4t, instead of t, t, t, t for the interval.

I attached test logs in the PR body, it ran in 3.61s.
I use the smallest backoff supported by the library but we can reduce n if that's too much for a single unit test.

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

knative-metrics-robot · 2020-11-03T16:44:08Z

The following is the coverage report on the affected files.
Say /test pull-knative-eventing-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/kncloudevents/message_sender.go	65.9%	82.9%	17.1

zhongduo · 2020-11-03T17:12:33Z

/lgtm

pierDipi · 2020-11-03T17:20:41Z

Should we backport this?

matzew · 2020-11-03T18:25:13Z

Should we backport this?

it would be good to backport to 0.18 and 0.17 ?

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

pierDipi · 2020-11-03T19:17:26Z

@matzew backported, please, check linked PRs.

* Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Retry on network failures (#4454) * Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Retry on network failures (knative#4454) * Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * fixing imports Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Retry on network failures (#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Retry on network failures (knative#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Update pingsource-mt-adapter.yaml * Like on 0.18.3, we skip the tracing tests Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> * [release-0.18] Retry on network failures (knative#4454) (knative#4457) * Retry on network failures (knative#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Backport knative#4465 (knative#4468) Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * [0.18] Backport knative#4466 (knative#4471) * Remove double invocations to responseWriter.WriteHeader in filter handler (knative#4466) * Fix knative#4464 Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Docs Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Moar tests Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Linting Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Nit with metrics Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> (cherry picked from commit a6fc540) Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Nit Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * fixed wrong marshall in apiserversouece which will fix the missing ceOverrides extension (knative#4477) (knative#4480) * fixed wrong marshall * fixed UT * [0.18] Readyness probe in broker ingress (knative#4483) * Fix knative#4473 Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Massage the filter yaml Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> Co-authored-by: Matthias Wessendorf <mwessend@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Co-authored-by: Francesco Guardiani <francescoguard@gmail.com> Co-authored-by: capri-xiyue <52932582+capri-xiyue@users.noreply.github.com>

knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 3, 2020

knative-prow-robot requested review from aslom and matzew November 3, 2020 12:09

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2020

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Nov 3, 2020

Retry on network failure

51a8165

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

pierDipi force-pushed the KNATIVE-4453 branch from 4e63ba9 to 51a8165 Compare November 3, 2020 12:22

knative-prow-robot requested a review from slinkydeveloper November 3, 2020 12:29

slinkydeveloper reviewed Nov 3, 2020

View reviewed changes

zhongduo reviewed Nov 3, 2020

View reviewed changes

Use a fixed port number

722560c

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

knative-prow-robot assigned zhongduo Nov 3, 2020

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2020

knative-prow-robot merged commit c2bfc44 into knative:master Nov 3, 2020

pierDipi added a commit to pierDipi/eventing that referenced this pull request Nov 3, 2020

Retry on network failures (knative#4454)

767d591

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

This was referenced Nov 3, 2020

[release-0.18] Retry on network failures (#4454) #4457

Merged

[release-0.17] Retry on network failures (#4454) #4458

Merged

pierDipi deleted the KNATIVE-4453 branch November 25, 2021 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on network failures #4454

Retry on network failures #4454

pierDipi commented Nov 3, 2020 •

edited

Loading

knative-prow-robot commented Nov 3, 2020

codecov bot commented Nov 3, 2020 •

edited

Loading

pierDipi commented Nov 3, 2020

slinkydeveloper Nov 3, 2020 •

edited

Loading

vaikas Nov 3, 2020

maschmid Nov 3, 2020

slinkydeveloper Nov 3, 2020

pierDipi Nov 3, 2020

travis-minke-sap Nov 3, 2020

pierDipi Nov 3, 2020

zhongduo Nov 3, 2020

lionelvillard Nov 3, 2020

matzew Nov 3, 2020

zhongduo Nov 3, 2020

zhongduo Nov 3, 2020

pierDipi Nov 3, 2020

zhongduo Nov 3, 2020

zhongduo Nov 3, 2020

pierDipi Nov 3, 2020

knative-metrics-robot commented Nov 3, 2020

zhongduo commented Nov 3, 2020

pierDipi commented Nov 3, 2020

matzew commented Nov 3, 2020

pierDipi commented Nov 3, 2020

Retry on network failures #4454

Retry on network failures #4454

Conversation

pierDipi commented Nov 3, 2020 • edited Loading

Proposed Changes

knative-prow-robot commented Nov 3, 2020

codecov bot commented Nov 3, 2020 • edited Loading

Codecov Report

pierDipi commented Nov 3, 2020

slinkydeveloper Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knative-metrics-robot commented Nov 3, 2020

zhongduo commented Nov 3, 2020

pierDipi commented Nov 3, 2020

matzew commented Nov 3, 2020

pierDipi commented Nov 3, 2020

pierDipi commented Nov 3, 2020 •

edited

Loading

codecov bot commented Nov 3, 2020 •

edited

Loading

slinkydeveloper Nov 3, 2020 •

edited

Loading