Windows kubelet stats timeout updates #87730

marosset · 2020-01-31T19:07:00Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR addresses an issue where kubelet metrics call take a very long time on Windows nodes if more than a handful of containers are running.

Which issue(s) this PR fixes:

Fixes Stats performance is slow on Windows #74991

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Kubelet metrics gathered through metrics-server or prometheus should no longer timeout for Windows nodes running more than 3 pods.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

marosset · 2020-01-31T19:07:23Z

/sig windows
cc @PatrickLang

/test pull-kubernetes-e2e-aks-engine-azure-windows

test/e2e/windows/kubelet_metrics.go

PatrickLang · 2020-01-31T23:29:52Z

/retest pull-kubernetes-integration

PatrickLang · 2020-02-01T00:54:14Z

/milestone v1.18

PatrickLang · 2020-02-01T00:55:26Z

I'm testing a custom build with these changes plus #86101 since it also depends on metrics :)

PatrickLang · 2020-02-01T01:04:59Z

/assign @yliaog

PatrickLang · 2020-02-01T01:05:31Z

/assign @benmoss

pkg/kubelet/dockershim/docker_stats_windows.go

PatrickLang · 2020-02-01T01:12:38Z

/lgtm
I applied both this metrics fix and the limits fix, built, and metrics & limits tests all pass.

liggitt · 2020-02-06T17:15:21Z

test/e2e/windows/kubelet_stats.go

+	}
+
+	if foundNode == false {
+		framework.Skipf("Could not find and ready and schedulable Windows nodes")


this is failing batch merge (see https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-bazel-build/1225465355036528642)

Looks like skipf got moved from framework to framework/skipper with this commit
641321c

I'll rebase and push an update

liggitt · 2020-02-06T17:16:05Z

/test pull-kubernetes-bazel-build

…let stats for windows nodes

…atly reduce latency

k8s-ci-robot · 2020-02-06T19:56:16Z

@marosset: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-aks-engine-azure-windows	e5f6c6b5e205f369895d47d5858ce7eb2c2d5165	link	`/test pull-kubernetes-e2e-aks-engine-azure-windows`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

marosset · 2020-02-06T20:24:49Z

/retest

PatrickLang · 2020-02-06T21:35:59Z

/lgtm

…30-upstream-release-1.15 Automated cherry pick of #87730 upstream release 1.15

…30-upstream-release-1.17 Automated cherry pick of #87730 upstream release 1.17

…30-upstream-release-1.16 Automated cherry pick of #87730 upstream release 1.16

…s are present on the node Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats. However, unlike `docker stats` API that was used before, hcsshim does not keep information about exited containers. When the Kubelet lists containers (`docker_container.go:ListContainers()`), it sets `All: true`, retrieving non-running containers. When docker stats is called with such container id, it'll return a valid JSON with all values set to 0. The non-running containers are filtered later on in the process. When the hcsshim is called with such container id, it'll return an error, effectively stopping the stats retrieval for all containers.

k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Jan 31, 2020

k8s-ci-robot requested review from mtaufen and yujuhong January 31, 2020 19:07

PatrickLang reviewed Jan 31, 2020

View reviewed changes

test/e2e/windows/kubelet_metrics.go Outdated Show resolved Hide resolved

k8s-ci-robot added this to the v1.18 milestone Feb 1, 2020

k8s-ci-robot assigned yliaog Feb 1, 2020

k8s-ci-robot assigned benmoss Feb 1, 2020

yliaog reviewed Feb 1, 2020

View reviewed changes

pkg/kubelet/dockershim/docker_stats_windows.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned PatrickLang Feb 1, 2020

liggitt reviewed Feb 6, 2020

View reviewed changes

marosset added 2 commits February 6, 2020 17:59

adding e2e test to ensure it takes less than 10 seconds to query kube…

e8f1269

…let stats for windows nodes

Calling hcsshim instead of docker api to get stats for windows to gre…

999fdfa

…atly reduce latency

marosset force-pushed the windows-kubelet-stats-timeout-updates branch from e54fd36 to 999fdfa Compare February 6, 2020 18:17

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020

k8s-ci-robot merged commit 6a92f19 into kubernetes:master Feb 6, 2020

marosset deleted the windows-kubelet-stats-timeout-updates branch February 6, 2020 22:19

PatrickLang mentioned this pull request Feb 17, 2020

CAdvisor endpoint taking too long to return metrics for windows nodes #75752

Closed

This was referenced Feb 24, 2020

Automated cherry pick of #87730 upstream release 1.17 #88490

Merged

Automated cherry pick of #87730 upstream release 1.16 #88491

Merged

Automated cherry pick of #87730 upstream release 1.15 #88492

Merged

This was referenced Mar 2, 2020

Stats performance is slow on Windows #74991

Closed

The kubelet /stats/summary is too expensive on windows #82522

Closed

k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020

Merge pull request #88492 from marosset/automated-cherry-pick-of-#877…

2566b40

…30-upstream-release-1.15 Automated cherry pick of #87730 upstream release 1.15

k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020

Merge pull request #88490 from marosset/automated-cherry-pick-of-#877…

4a13222

…30-upstream-release-1.17 Automated cherry pick of #87730 upstream release 1.17

k8s-ci-robot added a commit that referenced this pull request Apr 7, 2020

Merge pull request #88491 from marosset/automated-cherry-pick-of-#877…

33357ef

…30-upstream-release-1.16 Automated cherry pick of #87730 upstream release 1.16

vboulineau mentioned this pull request Apr 28, 2020

kubelet: fix /stats/summary endpoint on Windows when init-containers are present on the node #90554

Merged

mikkelhegn mentioned this pull request Jun 23, 2020

Common Horizontal Pod Autoscaler events in AKS cluster Azure/AKS#1137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows kubelet stats timeout updates #87730

Windows kubelet stats timeout updates #87730

marosset commented Jan 31, 2020 •

edited

Loading

marosset commented Jan 31, 2020

PatrickLang commented Jan 31, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

liggitt Feb 6, 2020

marosset Feb 6, 2020

marosset Feb 6, 2020 •

edited

Loading

liggitt commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020 •

edited

Loading

marosset commented Feb 6, 2020

PatrickLang commented Feb 6, 2020

Windows kubelet stats timeout updates #87730

Windows kubelet stats timeout updates #87730

Conversation

marosset commented Jan 31, 2020 • edited Loading

marosset commented Jan 31, 2020

PatrickLang commented Jan 31, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

PatrickLang commented Feb 1, 2020

liggitt Feb 6, 2020

Choose a reason for hiding this comment

marosset Feb 6, 2020

Choose a reason for hiding this comment

marosset Feb 6, 2020 • edited Loading

Choose a reason for hiding this comment

liggitt commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020 • edited Loading

marosset commented Feb 6, 2020

PatrickLang commented Feb 6, 2020

marosset commented Jan 31, 2020 •

edited

Loading

marosset Feb 6, 2020 •

edited

Loading

k8s-ci-robot commented Feb 6, 2020 •

edited

Loading