[k8scluster] add k8s.container.status_waiting_reason metric #32457

ElfoLiNk · 2024-04-16T17:40:32Z

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

I would like to get some container state metrics, about waiting reason. One use case is to know whether the container is in CrashLoopBackOff.

Example happening in pod:

kubectl get pod X -o yaml

...
apiVersion: v1
kind: Pod
...
status:
  conditions:
  containerStatuses:
  - containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
    image: docker.io/otel/opentelemetry-collector-contrib:0.97.0
    imageID: docker.io/otel/opentelemetry-collector-contrib@sha256:42a27d048c35720cf590243223543671e9d9f1ad8537d5a35c4b748fc8ebe873
    lastState:
      terminated:
        containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
        exitCode: 2
        finishedAt: "2024-04-16T17:30:04Z"
        reason: Error
        startedAt: "2024-04-16T17:29:35Z"
    name: opentelemetry-collector
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=opentelemetry-collector
          pod=opentelemetry-obs-col-2_obs(58012348-343b-4895-a39e-27e49f014ae8)
        reason: CrashLoopBackOff

Kube State Metrics has this modelled as this Prometheus metric:

kube_pod_container_status_waiting_reason
container=<container-name>
pod=<pod-name>
namespace=<pod-namespace>
reason=<container-waiting-reason>
uid=<pod-uid>

Ref: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

So would be great to have a similar metric.

Describe the solution you'd like

  k8s.container.status_waiting_reason:
    enabled: false
    description: Describes the reason the container is currently in waiting state.
    unit: ""
    attributes:
      - reason
    gauge:
      value_type: int

https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/pod.go#L554-L578

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-16T17:40:48Z

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv · 2024-04-24T11:59:18Z

FYI I've opened a PR on semconv for last terminated reason -> open-telemetry/semantic-conventions#922 and looks like some refactorings are needed on my PR. So this time let's first agree if we want this and then make a PR to semconv

github-actions · 2024-07-02T03:30:45Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Bhogayata-Keval · 2024-07-29T11:37:50Z

I am seeing that the k8s.container.status.current_waiting_reason property has been added in Semantic Conventions.
Do we need to wait for any more checks before drafting a PR ?

I am happy to contribute, if required.

povilasv · 2024-07-30T16:44:49Z

FYI this was reverted in open-telemetry/semantic-conventions#1115

see the discussion in original PR open-telemetry/semantic-conventions#997

github-actions · 2024-09-30T03:36:22Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv · 2024-10-07T10:38:28Z

People keep asking me about this issue, so I think we should solve for it somehow in OTEL.

I'm thinking to propose a simple 0 / 1 state metric, to track if container is waiting for something. This is what Kube State Metrics does with kube_pod_container_status_waiting metric.

My proposal is this:

k8s.container.status.waiting:
    enabled: false
    description:  Wheter container is in waiting state. (0 for now, 1 for yes)
    gauge:
      value_type: int

@TylerHelmuth / @dmitryax thoughts?

I think we already have similiar metrics in Cluster Receiver, so it should fit our current model. Example:

  k8s.container.ready:
    enabled: true
    description: Whether a container has passed its readiness probe (0 for no, 1 for yes)
    unit: ""
    gauge:
      value_type: int

TylerHelmuth · 2024-10-08T14:29:08Z

I actually ran into this the other week as well and would like a solution. I thought the semantic convention SIG was blocking us on entities?

povilasv · 2024-10-09T07:34:53Z

Initially I wanted to add resource attribute k8s.container.status.current_waiting_reason which has the actual reason of why Container is in waiting state. Example k8s.container.status.current_waiting_reason=CrashLoopBackOff.

This didn't work due to Resource Attribute immutability.

This new PR actually does a different thing, I'm adding an enum metric, which checks if container is in waiting state or not.
So it's a metric that tracks container state, but doesn't tell you the reason.

Given current OTEL model, the actual reason will probably go to Entities as non identifying attribute 🤔 While having waiting state metric IMO still makes sense and is useful.

ElfoLiNk added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024

github-actions bot added the receiver/k8scluster label Apr 16, 2024

ElfoLiNk mentioned this issue Apr 16, 2024

[k8scluster] add k8s.container.status_last_terminated_reason metric #31282

Closed

This was referenced Apr 30, 2024

[receiver/k8scluster] add optional k8s.container.status.waiting_reason attribute #32756

Closed

add k8s.container.status.waiting_reason resource attribute open-telemetry/semantic-conventions#996

Closed

TylerHelmuth added priority:p2 Medium and removed needs triage New item requiring triage labels May 2, 2024

github-actions bot added the Stale label Jul 2, 2024

ChrsMark mentioned this issue Jul 29, 2024

Add K8s Container Waiting Reason Information in attributes #34280

Closed

github-actions bot removed the Stale label Jul 30, 2024

bhogayatakb mentioned this issue Jul 30, 2024

ENG-3562 : Added K8s Container Waiting Reason middleware-labs/opentelemetry-collector-contrib#90

Merged

povilasv mentioned this issue Aug 19, 2024

k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

Open

github-actions bot added the Stale label Sep 30, 2024

github-actions bot removed the Stale label Oct 8, 2024

povilasv linked a pull request Oct 8, 2024 that will close this issue

feat: [receiver/k8scluster] Add optional k8s.container.status.waiting metric #35668

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8scluster] add k8s.container.status_waiting_reason metric #32457

[k8scluster] add k8s.container.status_waiting_reason metric #32457

ElfoLiNk commented Apr 16, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024

povilasv commented Apr 24, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024

Bhogayata-Keval commented Jul 29, 2024

povilasv commented Jul 30, 2024

github-actions bot commented Sep 30, 2024

povilasv commented Oct 7, 2024 •

edited

Loading

TylerHelmuth commented Oct 8, 2024

povilasv commented Oct 9, 2024

[k8scluster] add k8s.container.status_waiting_reason metric #32457

[k8scluster] add k8s.container.status_waiting_reason metric #32457

Comments

ElfoLiNk commented Apr 16, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Apr 16, 2024

povilasv commented Apr 24, 2024 • edited Loading

github-actions bot commented Jul 2, 2024

Bhogayata-Keval commented Jul 29, 2024

povilasv commented Jul 30, 2024

github-actions bot commented Sep 30, 2024

povilasv commented Oct 7, 2024 • edited Loading

TylerHelmuth commented Oct 8, 2024

povilasv commented Oct 9, 2024

ElfoLiNk commented Apr 16, 2024 •

edited

Loading

povilasv commented Apr 24, 2024 •

edited

Loading

povilasv commented Oct 7, 2024 •

edited

Loading