Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pluginwatcher flake #20212

Merged
merged 1 commit into from
Jul 6, 2018

Conversation

vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Jul 5, 2018

This is a stopgap(temporary) fix for the flake, #20136
This part has got changed recently, and this issue is not there at upstream kube. This PR,kubernetes/kubernetes#65662, is cherry-picking the changed part to kube 1.11.

Meanwhile this fix will help avoiding this flake.

Implementation of NotifyRegistrationStatus RPC at example plugin is writing registration status to a channel that unit test is listening on. Unit test after receving data on this channel, stops the example plugin server. Here is a race. Sometimes plugin server gets closed before NotifyRegistrationStatus could return and thus showing error on client (plugin watcher), which if being verified in the unit test.

Fix: With this 1 ms delay, plugin server is not getting closed before RPC return.

WIthout this fix, issue is happening at my local machine with this cmd:

go test -race -count 10 k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher -run TestExamplePlugin

After this fix, following is also working:

go test -race -count 500 k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher -run TestExamplePlugin

UPDATE: Increased sleep time to 1 sec

/cc @sjenning @liggitt

@vikaschoudhary16 vikaschoudhary16 requested a review from deads2k July 5, 2018 12:21
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 5, 2018
@@ -212,6 +212,7 @@ func TestExamplePlugin(t *testing.T) {
func waitForPluginRegistrationStatus(t *testing.T, statusCh chan registerapi.RegistrationStatus) bool {
select {
case status := <-statusCh:
time.Sleep(time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still flake prone at load. If a sleep is needed, sleep for longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased to 1 sec, i think it should be enough.

@sjenning
Copy link
Contributor

sjenning commented Jul 5, 2018

@vikaschoudhary16 please make the commit msg UPSTREAM: <drop>: Fix pluginwatcher flake so that it will pass hack/verify-upstream-commits.sh and be dropped on the next rebase.

@sjenning
Copy link
Contributor

sjenning commented Jul 5, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2018
@deads2k
Copy link
Contributor

deads2k commented Jul 5, 2018

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sjenning, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2018
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@smarterclayton smarterclayton merged commit 44a6971 into openshift:master Jul 6, 2018
@smarterclayton
Copy link
Contributor

Package is now flaking with:

=== RUN   TestExamplePlugin
E0706 05:58:32.056187   21679 example_plugin.go:129] example server stopped serving: accept unix /tmp/plugin_test285297923/plugin.sock: use of closed network connection
E0706 05:58:33.056929   21679 example_plugin.go:147] Timed out on waiting for stop completion
--- FAIL: TestExamplePlugin (2.20s)
	assertions.go:254: 
			Error Trace:	plugin_watcher_test.go:138
			Error:      	Received unexpected error:
			            	Timed out on waiting for stop completion
			Test:       	TestExamplePlugin

in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20231/pull-ci-origin-unit/965/

@vikaschoudhary16
Copy link
Contributor Author

Sorry, my mistake. Did not retest after increasing timeout from 1 ms to 1 sec. Missed to notice that timeout duration for plugin stop is also 1 sec and therefore it started flaking, which is obvious.
Fixing this mistake here: #20237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. queue/critical-fix size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants