Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8scluster] add k8s.container.status_waiting_reason metric #32457

Open
ElfoLiNk opened this issue Apr 16, 2024 · 9 comments · May be fixed by #35668
Open

[k8scluster] add k8s.container.status_waiting_reason metric #32457

ElfoLiNk opened this issue Apr 16, 2024 · 9 comments · May be fixed by #35668
Labels

Comments

@ElfoLiNk
Copy link

ElfoLiNk commented Apr 16, 2024

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

I would like to get some container state metrics, about waiting reason. One use case is to know whether the container is in CrashLoopBackOff.

Example happening in pod:

kubectl get pod X -o yaml

...
apiVersion: v1
kind: Pod
...
status:
  conditions:
  containerStatuses:
  - containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
    image: docker.io/otel/opentelemetry-collector-contrib:0.97.0
    imageID: docker.io/otel/opentelemetry-collector-contrib@sha256:42a27d048c35720cf590243223543671e9d9f1ad8537d5a35c4b748fc8ebe873
    lastState:
      terminated:
        containerID: containerd://e7d1583c9d91178c1f649d5d5a4d38f10decbd4a2d921976909e9d6ab5f3ac23
        exitCode: 2
        finishedAt: "2024-04-16T17:30:04Z"
        reason: Error
        startedAt: "2024-04-16T17:29:35Z"
    name: opentelemetry-collector
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=opentelemetry-collector
          pod=opentelemetry-obs-col-2_obs(58012348-343b-4895-a39e-27e49f014ae8)
        reason: CrashLoopBackOff

Kube State Metrics has this modelled as this Prometheus metric:

kube_pod_container_status_waiting_reason
container=<container-name>
pod=<pod-name>
namespace=<pod-namespace>
reason=<container-waiting-reason>
uid=<pod-uid>

Ref: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

So would be great to have a similar metric.

Describe the solution you'd like

  k8s.container.status_waiting_reason:
    enabled: false
    description: Describes the reason the container is currently in waiting state.
    unit: ""
    attributes:
      - reason
    gauge:
      value_type: int

https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/pod.go#L554-L578

Describe alternatives you've considered

No response

Additional context

No response

@ElfoLiNk ElfoLiNk added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@povilasv
Copy link
Contributor

povilasv commented Apr 24, 2024

FYI I've opened a PR on semconv for last terminated reason -> open-telemetry/semantic-conventions#922 and looks like some refactorings are needed on my PR. So this time let's first agree if we want this and then make a PR to semconv

Copy link
Contributor

github-actions bot commented Jul 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Bhogayata-Keval
Copy link

I am seeing that the k8s.container.status.current_waiting_reason property has been added in Semantic Conventions.
Do we need to wait for any more checks before drafting a PR ?

I am happy to contribute, if required.

@povilasv
Copy link
Contributor

FYI this was reverted in open-telemetry/semantic-conventions#1115

see the discussion in original PR open-telemetry/semantic-conventions#997

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 30, 2024
@povilasv
Copy link
Contributor

povilasv commented Oct 7, 2024

People keep asking me about this issue, so I think we should solve for it somehow in OTEL.

I'm thinking to propose a simple 0 / 1 state metric, to track if container is waiting for something. This is what Kube State Metrics does with kube_pod_container_status_waiting metric.

My proposal is this:

k8s.container.status.waiting:
    enabled: false
    description:  Wheter container is in waiting state. (0 for now, 1 for yes)
    gauge:
      value_type: int

@TylerHelmuth / @dmitryax thoughts?

I think we already have similiar metrics in Cluster Receiver, so it should fit our current model. Example:

  k8s.container.ready:
    enabled: true
    description: Whether a container has passed its readiness probe (0 for no, 1 for yes)
    unit: ""
    gauge:
      value_type: int

@TylerHelmuth
Copy link
Member

I actually ran into this the other week as well and would like a solution. I thought the semantic convention SIG was blocking us on entities?

@povilasv
Copy link
Contributor

povilasv commented Oct 9, 2024

Initially I wanted to add resource attribute k8s.container.status.current_waiting_reason which has the actual reason of why Container is in waiting state. Example k8s.container.status.current_waiting_reason=CrashLoopBackOff.

This didn't work due to Resource Attribute immutability.

This new PR actually does a different thing, I'm adding an enum metric, which checks if container is in waiting state or not.
So it's a metric that tracks container state, but doesn't tell you the reason.

Given current OTEL model, the actual reason will probably go to Entities as non identifying attribute 🤔 While having waiting state metric IMO still makes sense and is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants