Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] In-tree Volumes: Error getting c-m metrics : the server could not find the requested resource #86318

Closed
spiffxp opened this issue Dec 16, 2019 · 24 comments · Fixed by #88982
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Dec 16, 2019

Which jobs are flaking:

  • pull-kubernetes-e2e-gce
  • ci-kubernetes-e2e-gci-gce

Which test(s) are flaking:

  • [sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should support readOnly file specified in the volumeMount [LinuxOnly]
  • [sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Inline-volume (default fs)] subPath should support file as subpath [LinuxOnly]
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link] [Testpattern: Pre-provisioned PV (default fs)] volumes should allow exec of files on the volume
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing single file [LinuxOnly]
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] volumes should store data
  • [sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Inline-volume (default fs)] volumes should store data
  • [sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] subPath should be able to unmount after the subpath directory is deleted
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should be able to unmount after the subpath directory is deleted
  • [sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory
  • [sig-storage] In-tree Volumes [Driver: gluster] [Testpattern: Inline-volume (default fs)] subPath should support non-existent path
  • [sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (default fs)] subPath should support existing directories when readOnly specified in the volumeSource
  • [sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should be able to unmount after the subpath directory is deleted
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directories when readOnly specified in the volumeSource
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing single file [LinuxOnly]
  • [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (ext3)] volumes should allow exec of files on the volume
  • [sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand Verify if offline PVC expansion works
  • [sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Dynamic PV (ext3)] volumes should store data

Testgrid link:

Reason for failure:
Failure fetching metrics:

Dec 15 22:01:17.942: Error getting c-m metrics : the server rejected our request for an unknown reason (get pods kube-controller-manager-e2e-4fc8c5bbfe-674b9-master:10252)
...
Causes: [
                    {
                        Type: "UnexpectedServerResponse",
                        Message: "unknown",
                        Field: "",
                    },
                ],
Dec  4 20:06:05.021: Error getting Kubelet e2e-f19735d0af-674b9-minion-group-l4ln metrics: the server is currently unable to handle the request (get nodes e2e-f19735d0af-674b9-minion-group-l4ln:10250)
...
Causes: [
                    {
                        Type: "UnexpectedServerResponse",
                        Message: "Error trying to reach service: 'dial tcp 10.40.0.4:10250: i/o timeout'",
                        Field: "",
                    },
                ],

Anything else we need to know:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics&job=pull-kubernetes-e2e-gce%24&test=sig-storage

This appears to have restarted abruptly around 2/11-2/12

/sig storage
/priority important-soon

@spiffxp spiffxp added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 16, 2019
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 16, 2019
@spiffxp spiffxp added this to the v1.18 milestone Dec 16, 2019
@liggitt
Copy link
Member

liggitt commented Dec 17, 2019

@kubernetes/sig-node-test-failures

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 17, 2019
@liggitt
Copy link
Member

liggitt commented Dec 17, 2019

after sweeping a bunch of the in-tree test failures, almost all seemed to have #86312 as a root cause, triggered by very short-lived containers used to check file content or permissions.

The only test I saw failing that didn't seem to fit that was:

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support non-existent path

test/e2e/storage/testsuites/subpath.go:191
Dec 17 19:12:19.924: Unexpected error:
    <*errors.errorString | 0xc0027e44b0>: {
        s: "expected \"content of file \\\"/test-volume/provisioning-9844/test-file\\\": mount-tester new file\" in container output: Expected\n    <string>: content of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-9844/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retrying\nto contain substring\n    <string>: content of file \"/test-volume/provisioning-9844/test-file\": mount-tester new file",
    }
    expected "content of file \"/test-volume/provisioning-9844/test-file\": mount-tester new file" in container output: Expected
        <string>: content of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-9844/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retrying
    to contain substring
        <string>: content of file "/test-volume/provisioning-9844/test-file": mount-tester new file
occurred
test/e2e/framework/util.go:829

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory

test/e2e/storage/testsuites/subpath.go:202
Dec 17 14:08:37.147: Unexpected error:
    <*errors.errorString | 0xc0021bf240>: {
        s: "expected \"content of file \\\"/test-volume/provisioning-5457/test-file\\\": mount-tester new file\" in container output: Expected\n    <string>: content of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retryingcontent of file \"/test-volume/provisioning-5457/test-file\": \n    Unexpected content. Expected: mount-tester new file\n    . Retrying\nto contain substring\n    <string>: content of file \"/test-volume/provisioning-5457/test-file\": mount-tester new file",
    }
    expected "content of file \"/test-volume/provisioning-5457/test-file\": mount-tester new file" in container output: Expected
        <string>: content of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retryingcontent of file "/test-volume/provisioning-5457/test-file": 
        Unexpected content. Expected: mount-tester new file
        . Retrying
    to contain substring
        <string>: content of file "/test-volume/provisioning-5457/test-file": mount-tester new file
occurred
test/e2e/framework/util.go:829

examples:

I'd recommend repurposing this issue to address that specifically and leave the OCI issue for the short-lived container issue

@MorrisLaw
Copy link
Member

MorrisLaw commented Jan 30, 2020

Here's a PR for context:
#87090

So, I think this PR is another example of flakes. Here are some links to previous runs: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-gce/1222925370781601792/ and https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-gce/1215759459838595074/

same message:

[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should support readOnly file specified in the volumeMount

This prow: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-kind-ipv6/1215759459826012161 has the above message and a few more, all [sig-storage]

[sig-storage] PersistentVolumes-local [Volume type: blockfswithoutformat] Two pods mounting a local volume one after the other should be able to write from pod1 and read from pod2

[sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] One pod requesting one prebound PVC should be able to mount volume and read from pod1

[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (default fs)] subPath should support non-existent path

[sig-storage] PersistentVolumes-local [Volume type: blockfswithformat] Set fsGroup for local volume should set different fsGroup for second pod if first pod is deleted 

[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (ext3)] volumes should store data

A lot of in-tree volume issues, but there were some persistent volume ones too. But, that showed up once so far from what I've seen.

@liggitt
Copy link
Member

liggitt commented Feb 14, 2020

The "error getting c-m metrics" failures appear to have just restarted:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics

@liggitt liggitt changed the title [Flaky Test] In-tree Volumes [Flaky Test] In-tree Volumes: Error getting c-m metrics : the server could not find the requested resource Feb 14, 2020
@liggitt
Copy link
Member

liggitt commented Feb 14, 2020

/priority critical-urgent

bumping priority since we're seeing significant numbers of failures that just started in the last two days

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 14, 2020
@liggitt
Copy link
Member

liggitt commented Mar 5, 2020

@gnufied
Copy link
Member

gnufied commented Mar 5, 2020

I haven't had chance to fully debug this yet. @davidz627 looked into this little bit and he suspected that these tests started to flake same time around #85029 was merged. I should be able to take a closer look tomorrow.

@aojea
Copy link
Member

aojea commented Mar 9, 2020

This occurrence https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88934/pull-kubernetes-e2e-gce/1237026445675466753/ fails because it's trying to get the metrics from the kube-controller-manager pod but it is not ready until one minute later

            Message: "the server could not find the requested resource (get pods kube-controller-manager-e2e-b4cf5dd729-674b9-master:10252)",

if we grep in the kube-apiserver for the controller pod name we can observe that after the pod is up it starts to get 200 instead of 404

I0309 15:05:22.546216       1 httplog.go:90] verb="GET" URI="/api/v1/namespaces/kube-system/pods/kube-controller-manager-e2e-b4cf5dd729-674b9-master:10252/proxy/metrics" latency=396.390669ms resp=404 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: block] [Testpattern: Pre-provisioned PV (default fs)] volumes should allow exec of files on the volume" srcIP="35.188.161.128:38518":
I0309 15:05:25.073291       1 httplog.go:90] verb="GET" URI="/api/v1/namespaces/kube-system/pods/kube-controller-manager-e2e-b4cf5dd729-674b9-master:10252/proxy/metrics" latency=92.126104ms resp=404 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directory" srcIP="35.188.161.128:38496":
I0309 15:05:29.247571       1 httplog.go:90] verb="GET" URI="/api/v1/namespaces/kube-system/pods/kube-controller-manager-e2e-b4cf5dd729-674b9-master:10252/proxy/metrics" latency=149.299022ms resp=404 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath should be able to unmount after the subpath directory is deleted" srcIP="35.188.161.128:38472":
I0309 15:05:35.389649       1 admission.go:211] getting pod security policies for pod kube-controller-manager-e2e-b4cf5dd729-674b9-master (generate: )
I0309 15:05:35.394447       1 admission.go:135] pod kube-controller-manager-e2e-b4cf5dd729-674b9-master (generate: ) in namespace kube-system validated against provider gce.privileged
I0309 15:05:35.397801       1 admission.go:211] getting pod security policies for pod kube-controller-manager-e2e-b4cf5dd729-674b9-master (generate: )
I0309 15:05:37.160211       1 graph_populator.go:148] updatePod kube-system/kube-controller-manager-e2e-b4cf5dd729-674b9-master for node e2e-b4cf5dd729-674b9-master
I0309 15:05:39.631917       1 httplog.go:90] verb="GET" URI="/api/v1/namespaces/kube-system/pods/kube-controller-manager-e2e-b4cf5dd729-674b9-master:10252/proxy/metrics" latency=139.78459ms resp=400 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory" srcIP="35.188.161.128:38478":
I0309 15:05:40.646009       1 httplog.go:90] verb="GET" URI="/api/v1/namespaces/kube-system/pods/kube-controller-manager-e2e-b4cf5dd729-674b9-master" latency=250.194561ms resp=200 UserAgent="kubelet/v1.19.0 (linux/amd64) kubernetes/ea63990" srcIP="34.83.30.189:52846":

@liggitt liggitt reopened this Mar 10, 2020
@liggitt
Copy link
Member

liggitt commented Mar 10, 2020

keeping open until https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics&job=pull-kubernetes-e2e-gce%24&test=sig-storage shows this is resolved

@liggitt
Copy link
Member

liggitt commented Mar 10, 2020

seen again in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89010/pull-kubernetes-e2e-gce/1237448373842219008

@aojea
Copy link
Member

aojea commented Mar 10, 2020

I0310 18:56:00.070041    1362 prober.go:133] Liveness probe for "kube-controller-manager-e2e-507247306d-674b9-master_kube-system(0c199f3fb8f138daa4d4f41cf004bb41):kube-controller-manager" succeeded

this time the container is running but still it can't find it 🤔

test/e2e/storage/testsuites/subpath.go:216
Mar 10 18:56:03.621: Error getting c-m metrics : the server could not find the requested resource (get pods kube-controller-manager-e2e-507247306d-674b9-master:10252)

@aojea
Copy link
Member

aojea commented Mar 10, 2020

@liggitt the errors comes from the request to the apiserver on

func (g *Grabber) getMetricsFromPod(client clientset.Interface, podName string, namespace string, port int) (string, error) {

How can we check that the pod is found and that the apiserver found it?
https://github.com/kubernetes/kubernetes/pull/88982/files#diff-33cbf1a465edc9c1de6150d5e6748177R170

@liggitt
Copy link
Member

liggitt commented Mar 11, 2020

the first successful get of /metrics logged by the controller manager in that run is at:

I0310 18:56:31.117404       1 httplog.go:90] verb="GET" URI="/metrics" latency=115.951652ms resp=200 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Pre-provisioned PV (default fs)] subPath should support file as subpath [LinuxOnly]" srcIP="10.40.0.2:54978": 

We're also not checking the error in the once.Do to make sure the pod was running, and once running, we should probably then also check the metrics path on the pod. Maybe something like this:

	var err error
	g.waitForControllerManagerReadyOnce.Do(func() {
		if runningErr := e2epod.WaitForPodNameRunningInNamespace(g.client, podName, metav1.NamespaceSystem); runningErr != nil {
			err = fmt.Errorf("error waiting for controller manager pod to be running: %w", err)
			return
		}

		var lastMetricsFetchErr error
		if metricsWaitErr := wait.PollImmediate(time.Second, time.Minute, func() (bool, error) {
			_, lastMetricsFetchErr = g.getMetricsFromPod(g.client, podName, metav1.NamespaceSystem, ports.InsecureKubeControllerManagerPort)
			return lastMetricsFetchErr == nil, nil
		}); metricsWaitErr != nil {
			err = fmt.Errorf("error waiting for controller manager pod to be running: %v; %v", err, lastMetricsFetchErr)
			return
		}
	})
	if err != nil {
		return ControllerManagerMetrics{}, err
	}

@liggitt
Copy link
Member

liggitt commented Mar 12, 2020

still happening:

Mar 12 18:52:19.595: Error getting c-m metrics : error waiting for controller manager pod to be running: %!w(<nil>)
Unexpected error:
    <*errors.errorString | 0xc000fd4150>: {
        s: "error waiting for controller manager pod to be running: %!w(<nil>)",
    }
    error waiting for controller manager pod to be running: %!w(<nil>)

message logging missed using the correct errors, so the messages aren't helpful:

--- a/test/e2e/framework/metrics/metrics_grabber.go
+++ b/test/e2e/framework/metrics/metrics_grabber.go
@@ -170,7 +170,7 @@ func (g *Grabber) GrabFromControllerManager() (ControllerManagerMetrics, error)
        podName := fmt.Sprintf("%v-%v", "kube-controller-manager", g.masterName)
        g.waitForControllerManagerReadyOnce.Do(func() {
                if runningErr := e2epod.WaitForPodNameRunningInNamespace(g.client, podName, metav1.NamespaceSystem); runningErr != nil {
-                       err = fmt.Errorf("error waiting for controller manager pod to be running: %w", err)
+                       err = fmt.Errorf("error waiting for controller manager pod to be running: %w", runningErr)
                        return
                }

@@ -179,7 +179,7 @@ func (g *Grabber) GrabFromControllerManager() (ControllerManagerMetrics, error)
                        _, lastMetricsFetchErr = g.getMetricsFromPod(g.client, podName, metav1.NamespaceSystem, ports.InsecureKubeControllerManagerPort)
                        return lastMetricsFetchErr == nil, nil
                }); metricsWaitErr != nil {
-                       err = fmt.Errorf("error waiting for controller manager pod to expose metrics: %v; %v", err, lastMetricsFetchErr)
+                       err = fmt.Errorf("error waiting for controller manager pod to expose metrics: %v; %v", metricsWaitErr, lastMetricsFetchErr)
                        return
                }
        })

@aojea
Copy link
Member

aojea commented Mar 13, 2020

@liggitt what's this stack trace in the apiserver? is this related? a 503 shouldn't cause that, right?

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/88893/pull-kubernetes-e2e-gce/1238171676051509250/artifacts/e2e-ccb8822d86-674b9-master/kube-apiserver.log

I0312 18:50:41.270902       1 httplog.go:90] verb="GET" URI="/apis/metrics.k8s.io/v1beta1?timeout=32s" latency=1.300633ms resp=503 UserAgent="kubectl/v1.19.0 (linux/amd64) kubernetes/95e6759" srcIP="[::1]:43516": 
goroutine 22751 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc0078737a0, 0x1f7)
	staging/src/k8s.io/apiserver/pkg/server/httplog/httplog.go:225 +0xc8
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc0078737a0, 0x1f7)
	staging/src/k8s.io/apiserver/pkg/server/httplog/httplog.go:204 +0x35
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc00a3c31a0, 0x1f7)
	staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go:228 +0xb2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.(*auditResponseWriter).WriteHeader(0xc002fc1900, 0x1f7)
	staging/src/k8s.io/apiserver/pkg/endpoints/filters/audit.go:219 +0x63
net/http.Error(0x54f3fa0, 0xc012a2b2a8, 0x4730bc4, 0x13, 0x1f7)
	GOROOT/src/net/http/server.go:2019 +0x1f4
k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver.proxyError(0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600, 0x4730bc4, 0x13, 0x1f7)
	staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go:97 +0x57
k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver.(*proxyHandler).ServeHTTP(0xc007eb3420, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go:126 +0xc2b
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*pathHandler).ServeHTTP(0xc010687580, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:241 +0x544
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).ServeHTTP(0xc00073e5b0, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:234 +0x84
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x47213e2, 0xf, 0xc005413ef0, 0xc00073e5b0, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/server/handler.go:154 +0x6b1
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:64 +0x512
net/http.HandlerFunc.ServeHTTP(0xc004ef8540, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	GOROOT/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/server/filters/maxinflight.go:160 +0x4de
net/http.HandlerFunc.ServeHTTP(0xc0062e3500, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	GOROOT/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	staging/src/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:50 +0x1fe6
net/http.HandlerFunc.ServeHTTP(0xc004ef8580, 0x54f3fa0, 0xc012a2b2a8, 0xc00a40c600)
	GOROOT/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1(0x7f984bd99158, 0xc012a2b2a0, 0xc00a40c500)
	staging/src/k8s.io/apiserver/pkg/endpoints/filters/audit.go:110 +0x4b9
net/http.HandlerFunc.ServeHTTP(0xc004ef85c0, 0x7f984bd99158, 0xc012a2b2a0, 0xc00a40c500)
	GOROOT/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7f984bd99158, 0xc012a2b2a0, 0xc00a40c300)
	staging/src/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:70 +0x5ce
net/http.HandlerFunc.ServeHTTP(0xc003038c80, 0x7f984bd99158, 0xc012a2b2a0, 0xc00a40c300)
	GOROOT/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc014085980, 0xc002006680, 0x55046e0, 0xc012a2b2a0, 0xc00a40c300)
	staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go:113 +0xd0
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
	staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go:99 +0x1cb

logging error output: "service unavailable\n"

@aojea
Copy link
Member

aojea commented Mar 14, 2020

again 👀
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/85564/pull-kubernetes-e2e-gce/1238630773502775296

�[1mSTEP�[0m: Creating a kubernetes client
Mar 14 01:17:14.922: INFO: >>> kubeConfig: /workspace/.kube/config
�[1mSTEP�[0m: Building a namespace api object, basename provisioning
Mar 14 01:17:15.245: INFO: Found PodSecurityPolicies; assuming PodSecurityPolicy is enabled.
Mar 14 01:17:15.477: INFO: Found ClusterRoles; assuming RBAC is enabled.
�[1mSTEP�[0m: Binding the e2e-test-privileged-psp PodSecurityPolicy to the default service account in provisioning-8245
�[1mSTEP�[0m: Waiting for a default service account to be provisioned in namespace
[It] should be able to unmount after the subpath directory is deleted
  test/e2e/storage/testsuites/subpath.go:439
Mar 14 01:17:15.986: INFO: Could not find CSI Name for in-tree plugin kubernetes.io/local-volume
Mar 14 01:17:16.166: FAIL: Error getting c-m metrics : error waiting for controller manager pod to be running: pods "kube-controller-manager-e2e-1a7f46e1b2-674b9-master" not found
Unexpected error:

it seems is not really waiting because it fails just 200 ms after the INFO msg, I think is because it checks if the pod is running, but the message is that the pod doesn't really exist.

The fact that there are 9 tests failing with the same error and we are using Sync.Once has to be related to the way Ginkgo use parallel.

@aojea
Copy link
Member

aojea commented Mar 14, 2020

It turns out that the method used to wait for the controller manager pod fails if the pod doesn't exist

func podRunning(c clientset.Interface, podName, namespace string) wait.ConditionFunc {
return func() (bool, error) {
pod, err := c.CoreV1().Pods(namespace).Get(context.TODO(), podName, metav1.GetOptions{})
if err != nil {
return false, err
}

@liggitt
Copy link
Member

liggitt commented Mar 14, 2020

Ah, so first we should wait for the pod to exist, then be ready, then return metrics

@liggitt
Copy link
Member

liggitt commented Mar 16, 2020

yes, looks good. thanks for all the work

/close

@k8s-ci-robot
Copy link
Contributor

@liggitt: Closing this issue.

In response to this:

yes, looks good. thanks for all the work

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Mar 16, 2020

Fixed by #89123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants