Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8scluster] add k8s.container.status_last_terminated_reason metric #31282

Closed
povilasv opened this issue Feb 15, 2024 · 14 comments
Closed

[k8scluster] add k8s.container.status_last_terminated_reason metric #31282

povilasv opened this issue Feb 15, 2024 · 14 comments
Labels

Comments

@povilasv
Copy link
Contributor

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

I would like to get some container state metrics, about termination. One use case is to know whether the container was terminated due to OOM kill or application errorr.

Example happening in pod:

kubectl get pod X -o yaml

...
apiVersion: v1
kind: Pod
...
status:
  containerStatuses:
  - containerID: containerd://07ff7db2706d20ddd26c6257c1c1dbf176917f855c7feef0e05db88159b1584d
    image: image
    imageID:  imageId
    lastState:
      terminated:
        containerID: containerid
        exitCode: 2
        finishedAt: "2024-02-05T23:17:38Z"
        reason: Error
        startedAt: "2024-02-05T23:17:33Z"

Kube State Metrics has this modelled as this Prometheus metric:

kube_pod_container_status_last_terminated_reason
labels:
container=
pod=
namespace=
reason=
uid=

Ref: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md

So would be great to have a similar metric.

Describe the solution you'd like

Not sure how to model it in otel correctly, but I'm thinking something like this:

  k8s.container.status_last_terminated_reason:
    enabled: false
    description: Last terminated reason of container. The unit is always 1.
    unit: ""
    attributes:
      - reason
    gauge:
      value_type: int

Describe alternatives you've considered

No response

Additional context

No response

Copy link
Contributor

Pinging code owners for receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member

@povilasv reason can be any value right? Does k8s have a specific set of reasons? This metric feels similar to Phase, should we try to follow the same pattern?

@povilasv
Copy link
Contributor Author

I've grepped thru the Kubernetes code base and couldn't find a set of possible Reason values. It seems that it's set dynamically somehow, and really hard to find the actual possible values.

I think this is why Kube-State-Metrics also did this similiar way -> https://github.com/kubernetes/kube-state-metrics/blob/122e5e899943eb78eaf3e366733d5dbec6613ac0/internal/store/pod.go#L339

@TylerHelmuth
Copy link
Member

In that case an attribute makes sense to me

@povilasv
Copy link
Contributor Author

Opened a PR for it #31281 :)

@avanish-vaghela
Copy link

Would adding a restart_reason label to the k8s.container.restart metric make sense?

@TylerHelmuth
Copy link
Member

@avanish-vaghela maybe. Please open another issue for that request.

@dmitryax
Copy link
Member

I don't get why another metric is needed for this. Can it be an optional resource attribute instead?

@avanish-vaghela
Copy link

@dmitryax I believe you meant metric attribute as the attribute's value would change based on the state when metric value is registered

@povilasv
Copy link
Contributor Author

I don't get why another metric is needed for this. Can it be an optional resource attribute instead?

Thanks for feedback. I think you are right this should be resource attribute. I think im still used modelling everything as metric. Given Otel has Resource model and this seems to fit more . Container is the resource and status_last_terminated_reason is the attribute of it.

dmitryax pushed a commit that referenced this issue Mar 12, 2024
… resource attribute (#31505)

**Description:** 
Add k8s.container.status.last_terminated_reason resource attribute

**Link to tracking Issue:** #31282
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
… resource attribute (open-telemetry#31505)

**Description:** 
Add k8s.container.status.last_terminated_reason resource attribute

**Link to tracking Issue:** open-telemetry#31282
@atoulme
Copy link
Contributor

atoulme commented Mar 26, 2024

Can this issue be closed now that #31505 is merged?

@ElfoLiNk
Copy link

@povilasv, thank you for enabling us to detect OOMKilled errors. However, which metric should we utilize for detecting CrashLoopBackOff?

kube-state-metrics is utilizing kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}.

Thank you.

@povilasv
Copy link
Contributor Author

@ElfoLiNk you need a different resource attribute (status_waiting_reason) for this. I suggest you file a new issue.

@ElfoLiNk
Copy link

@ElfoLiNk you need a different resource attribute (status_waiting_reason) for this. I suggest you file a new issue.

Created #32457

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants