Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imc-dispatcher doesn't retry on connection failures or EOFs #4453

Closed
maschmid opened this issue Nov 3, 2020 · 1 comment · Fixed by #4454
Closed

imc-dispatcher doesn't retry on connection failures or EOFs #4453

maschmid opened this issue Nov 3, 2020 · 1 comment · Fixed by #4454
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@maschmid
Copy link
Contributor

maschmid commented Nov 3, 2020

Describe the bug
Having a Broker with retries

  delivery:
    backoffDelay: PT1S
    backoffPolicy: linear
    retry: 60

Sending random events and randomly deleting mt-broker-filter pods (to simulate failover on node-evictions or upgrades),
events are sometimes not delivered and retries are not attempted, with errors in the inmemorychannel-dispatcher logs like:

{"level":"error","ts":"2020-11-02T20:17:43.621Z","logger":"inmemorychannel-dispatcher","caller":"fanout/fanout_message_handler.go:189","msg":"Fanout had an error","error":"unable to complete request to http://broker-filter.knative-eventing.svc.cluster.local/triggers/imc-broker-retries-4/receiver-0/8fedd9ad-e54f-49fe-a4ac-116bc4fcb891: Post \"http://broker-filter.knative-eventing.svc.cluster.local/triggers/imc-broker-retries-4/receiver-0/8fedd9ad-e54f-49fe-a4ac-116bc4fcb891\": dial tcp 172.30.184.71:80: connect: connection refused","stacktrace":"knative.dev/eventing/pkg/channel/fanout.(*MessageHandler).dispatch\n\t/root/.gvm/pkgsets/go1.14/global/src/knative.dev/eventing/pkg/channel/fanout/fanout_message_handler.go:189\nknative.dev/eventing/pkg/channel/fanout.createMessageReceiverFunction.func1.1\n\t/root/.gvm/pkgsets/go1.14/global/src/knative.dev/eventing/pkg/channel/fanout/fanout_message_handler.go:143"}
{"level":"error","ts":"2020-11-02T18:46:00.131Z","logger":"inmemorychannel-dispatcher","caller":"fanout/fanout_message_handler.go:189","msg":"Fanout had an error","error":"unable to complete request to http://broker-filter.knative-eventing.svc.cluster.local/triggers/imc-broker-retries-2/receiver-0/003b2107-d672-4f01-b07a-f66b2f5c2f5c: Post \"http://broker-filter.knative-eventing.svc.cluster.local/triggers/imc-broker-retries-2/receiver-0/003b2107-d672-4f01-b07a-f66b2f5c2f5c\": EOF","stacktrace":"knative.dev/eventing/pkg/channel/fanout.(*MessageHandler).dispatch\n\t/opt/app-root/src/go/src/knative.dev/eventing/pkg/channel/fanout/fanout_message_handler.go:189\nknative.dev/eventing/pkg/channel/fanout.createMessageReceiverFunction.func1.1\n\t/opt/app-root/src/go/src/knative.dev/eventing/pkg/channel/fanout/fanout_message_handler.go:143"}

The problem is that checkRetry doesn't retry on network errors (and retries only when the HTTP request was successful, see

https://github.com/knative/eventing/blob/master/pkg/kncloudevents/message_sender.go#L192-L194

func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) {
	return resp != nil && resp.StatusCode >= 300, nil
}

Expected behavior
connection refused and EOF can commonly happen when the target pod is being restarted due to eviction, node restart or upgrade, Channels/Brokers should retry in these situations when configured to do so.

The current behaviour also defeats the purpose of #4440 , as retrying won't always save events during upgrades.

To Reproduce
Hard to reproduce, seen only in a few of the "broker filter pod delete" events , I'll try to create a concise reproducer...

Knative release version
Reproduced on 0.17.2, but affects any version with the current checkRetry logic.

Additional context
Note that cloudevents spec allows resending the same events in case of network errors (see https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ):

If a duplicate event is re-sent (e.g. due to a network error) it MAY have the same id. Consumers MAY assume that Events with identical source and id are duplicates.
@maschmid maschmid added the kind/bug Categorizes issue or PR as related to a bug. label Nov 3, 2020
@pierDipi
Copy link
Member

pierDipi commented Nov 3, 2020

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants