Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for Antrea client to be ready before starting watches #1042

Merged
merged 1 commit into from
Aug 8, 2020

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Aug 5, 2020

Fixes #1035

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

These commands can only be run by members of the vmware-tanzu organization.

@tnqn
Copy link
Member Author

tnqn commented Aug 5, 2020

/test-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about error handling, otherwise LGTM

Comment on lines 309 to 310
klog.Errorf("Unable to get Antrea client: %v", err)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way to handle the error here? We will just log one error message and return the error (which will be ignored by the caller). I would argue that this is more of a "fatal" error, that should cause the agent to crash, so that it doesn't go unnoticed. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, I just took another look, wait.PollImmediateUntil only returns error when stopCh is closed, which is part of process exiting, then perhaps the log should just say it no longer waits for it and stop the goroutine gracefully. We shouldn't log errors in other controllers too.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/shared_informer.go#L266
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L288

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think my once concern is that if the ConfigMap doesn't become available for a while, e.g. because of a controller issue, we will not get a single message in the logs about it, unless I am missing something. Maybe a log message with an exponential backoff would be nice.

antoninbas
antoninbas previously approved these changes Aug 6, 2020
jianjuns
jianjuns previously approved these changes Aug 6, 2020
@tnqn tnqn dismissed stale reviews from jianjuns and antoninbas via b184ae2 August 7, 2020 15:10
@tnqn
Copy link
Member Author

tnqn commented Aug 7, 2020

/test-all

@tnqn
Copy link
Member Author

tnqn commented Aug 7, 2020

LGTM. I think my once concern is that if the ConfigMap doesn't become available for a while, e.g. because of a controller issue, we will not get a single message in the logs about it, unless I am missing something. Maybe a log message with an exponential backoff would be nice.

@antoninbas I understand your concern now, I have added an info log to indicate that the controller is waiting for antrea client ready per 2 seconds, I didn't use exponential backoff for the whole check because it still needs to be aggressive since it's important to receive network policy as early as possible. Please see if it addresses your concern.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn merged commit 2ec3fab into antrea-io:master Aug 8, 2020
@tnqn tnqn deleted the cert-warning branch August 8, 2020 04:43
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watches always fail to start once when agent restarts
5 participants