-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test] In-tree Volumes: Error getting c-m metrics : the server could not find the requested resource #86318
Comments
xref #86312 I think we have a general failure to start containers reliably. Almost every instance of these test failures contained the pod status error of "OCI runtime start failed: container process is already dead: unknown"
|
@kubernetes/sig-node-test-failures |
after sweeping a bunch of the in-tree test failures, almost all seemed to have #86312 as a root cause, triggered by very short-lived containers used to check file content or permissions. The only test I saw failing that didn't seem to fit that was: [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support non-existent path
[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory
examples:
I'd recommend repurposing this issue to address that specifically and leave the OCI issue for the short-lived container issue |
Here's a PR for context: So, I think this PR is another example of flakes. Here are some links to previous runs: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-gce/1222925370781601792/ and https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-gce/1215759459838595074/ same message:
This prow: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87090/pull-kubernetes-e2e-kind-ipv6/1215759459826012161 has the above message and a few more, all
A lot of in-tree volume issues, but there were some persistent volume ones too. But, that showed up once so far from what I've seen. |
The "error getting c-m metrics" failures appear to have just restarted: |
/priority critical-urgent bumping priority since we're seeing significant numbers of failures that just started in the last two days |
@gnufied this is failing many test runs per day, do you have updated status on this? https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics&job=pull-kubernetes-e2e-gce%24&test=sig-storage |
I haven't had chance to fully debug this yet. @davidz627 looked into this little bit and he suspected that these tests started to flake same time around #85029 was merged. I should be able to take a closer look tomorrow. |
This occurrence https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88934/pull-kubernetes-e2e-gce/1237026445675466753/ fails because it's trying to get the metrics from the kube-controller-manager pod but it is not ready until one minute later
if we grep in the kube-apiserver for the controller pod name we can observe that after the pod is up it starts to get 200 instead of 404
|
keeping open until https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics&job=pull-kubernetes-e2e-gce%24&test=sig-storage shows this is resolved |
this time the container is running but still it can't find it 🤔
|
@liggitt the errors comes from the request to the apiserver on func (g *Grabber) getMetricsFromPod(client clientset.Interface, podName string, namespace string, port int) (string, error) { How can we check that the pod is found and that the apiserver found it? |
the first successful get of /metrics logged by the controller manager in that run is at:
We're also not checking the error in the once.Do to make sure the pod was running, and once running, we should probably then also check the metrics path on the pod. Maybe something like this: var err error
g.waitForControllerManagerReadyOnce.Do(func() {
if runningErr := e2epod.WaitForPodNameRunningInNamespace(g.client, podName, metav1.NamespaceSystem); runningErr != nil {
err = fmt.Errorf("error waiting for controller manager pod to be running: %w", err)
return
}
var lastMetricsFetchErr error
if metricsWaitErr := wait.PollImmediate(time.Second, time.Minute, func() (bool, error) {
_, lastMetricsFetchErr = g.getMetricsFromPod(g.client, podName, metav1.NamespaceSystem, ports.InsecureKubeControllerManagerPort)
return lastMetricsFetchErr == nil, nil
}); metricsWaitErr != nil {
err = fmt.Errorf("error waiting for controller manager pod to be running: %v; %v", err, lastMetricsFetchErr)
return
}
})
if err != nil {
return ControllerManagerMetrics{}, err
} |
still happening:
message logging missed using the correct errors, so the messages aren't helpful: --- a/test/e2e/framework/metrics/metrics_grabber.go
+++ b/test/e2e/framework/metrics/metrics_grabber.go
@@ -170,7 +170,7 @@ func (g *Grabber) GrabFromControllerManager() (ControllerManagerMetrics, error)
podName := fmt.Sprintf("%v-%v", "kube-controller-manager", g.masterName)
g.waitForControllerManagerReadyOnce.Do(func() {
if runningErr := e2epod.WaitForPodNameRunningInNamespace(g.client, podName, metav1.NamespaceSystem); runningErr != nil {
- err = fmt.Errorf("error waiting for controller manager pod to be running: %w", err)
+ err = fmt.Errorf("error waiting for controller manager pod to be running: %w", runningErr)
return
}
@@ -179,7 +179,7 @@ func (g *Grabber) GrabFromControllerManager() (ControllerManagerMetrics, error)
_, lastMetricsFetchErr = g.getMetricsFromPod(g.client, podName, metav1.NamespaceSystem, ports.InsecureKubeControllerManagerPort)
return lastMetricsFetchErr == nil, nil
}); metricsWaitErr != nil {
- err = fmt.Errorf("error waiting for controller manager pod to expose metrics: %v; %v", err, lastMetricsFetchErr)
+ err = fmt.Errorf("error waiting for controller manager pod to expose metrics: %v; %v", metricsWaitErr, lastMetricsFetchErr)
return
}
}) |
@liggitt what's this stack trace in the apiserver? is this related? a 503 shouldn't cause that, right?
|
it seems is not really waiting because it fails just 200 ms after the INFO msg, I think is because it checks if the pod is running, but the message is that the pod doesn't really exist. The fact that there are 9 tests failing with the same error and we are using |
It turns out that the method used to wait for the controller manager pod fails if the pod doesn't exist kubernetes/test/e2e/framework/pod/resource.go Lines 148 to 153 in b290e0b
|
Ah, so first we should wait for the pod to exist, then be ready, then return metrics |
yes, looks good. thanks for all the work /close |
@liggitt: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixed by #89123 |
Which jobs are flaking:
Which test(s) are flaking:
Testgrid link:
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=In-tree&width=5&sort-by-flakiness=
Reason for failure:
Failure fetching metrics:
Anything else we need to know:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=Error%20getting%20c-m%20metrics&job=pull-kubernetes-e2e-gce%24&test=sig-storage
This appears to have restarted abruptly around 2/11-2/12
/sig storage
/priority important-soon
The text was updated successfully, but these errors were encountered: