Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CustomResourceMetrics info and stateset are now broken in main when scraping with prometheus #1973

Closed
primeroz opened this issue Feb 2, 2023 · 14 comments · Fixed by #1974
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@primeroz
Copy link

primeroz commented Feb 2, 2023

What happened:
before #1930 all metrics type for CustomResourceMetrics were reported as type gauge but now ( i do run a build from main for some testing ) the type defined in the CR is respected and a metrics are reported as type info and stateset

for example

 λ curl localhost:8000/metrics 2>/dev/null | egrep "*_info" | grep TYPE                                                                                                                                            
# TYPE capi_azurecluster_info info 

while those are valid OpenTelemtry types they are not valid Prometheus Types and so Prometheus fail to scrape with invalid metric type "info"

What you expected to happen:

Prometheus be able to scrape those metrics

How to reproduce it (as minimally and precisely as possible):

Run a build of KSM from main with CustomResourceMetrics with type set to Info or StateSet ( i do use the cluster-api example https://github.com/kubernetes-sigs/cluster-api/blob/main/hack/observability/kube-state-metrics/crd-config.yaml )

Anything else we need to know?:

I think it would be great to support both Prometheus and OpenTelemetry , maybe with a field to set the type of metrics to expose and revert to present all metrics as gauges when Prometheus is the intended target

Environment:

  • kube-state-metrics version: main
  • Kubernetes version (use kubectl version): 1.24.10
  • Cloud provider or hardware configuration:
  • Other info:

pinging @bavarianbidi for reference

@primeroz primeroz added the kind/bug Categorizes issue or PR as related to a bug. label Feb 2, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 2, 2023
@mrueg
Copy link
Member

mrueg commented Feb 3, 2023

@primeroz @bavarianbidi thanks for the report! Can you test and confirm if #1974 fixes your issue?

@mrueg mrueg self-assigned this Feb 3, 2023
@mrueg
Copy link
Member

mrueg commented Feb 3, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2023
@primeroz
Copy link
Author

primeroz commented Feb 3, 2023

I will test on Monday and report back. Thanks for the quick feedback

@primeroz
Copy link
Author

primeroz commented Feb 6, 2023

testing a build out of [49af6b3](https://github.com/mrueg/kube-state-metrics/commit/49af6b3945a071ec63de8b2cadaf4d5eec1bdd57) and is looking good. thank you

subbarao-meduri pushed a commit to stolostron/kube-state-metrics that referenced this issue Jun 1, 2023
* add rolebinding metrics

* fix typo

* fix filename and sort order

* Apply suggestions from code review

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

* fix metrics name

* fix typos

* feat(logging): migrate to structured logging

Signed-off-by: dmpe <John Malc> <cincenko@outlook.com>

* Apply suggestions from code review

Co-authored-by: JUN YANG <yang.jun22@zte.com.cn>

* Apply suggestions from code review

Co-authored-by: JUN YANG <yang.jun22@zte.com.cn>

* Apply suggestions from code review

Co-authored-by: JUN YANG <yang.jun22@zte.com.cn>

* Apply suggestions from code review

Co-authored-by: JUN YANG <yang.jun22@zte.com.cn>

* Replace pkg/errors with stdlib errors

github.com/pkg/errors is archived and not maintained anymore.
Replace it with stdlib functions.

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* pkg/customresourcestate implement info and stateSet metric type and refactor configuration file

* Adds detection of booleans in string format to getNum.
* Refactors configuration file to allow definition of different metric types
  having different configuration variables.
* Refactor order of types and funcs in pkg/customersourcestate.
* Allows info and stateSet metrics to iterate over arrays.
* Adds `nilIsZero` config variable to gauge to indicate non-existing values to tread as 0 value instead of returning an error.
* Skip adding a label instead of setting value to `<nil>`.
* Replace namespace and subsystem by metricsNamePrefix
* Adjust docs for customresourcestate metrics to align with new configuration file

* Do not expose info metric for nil objects

* Update dependencies

go v1.18.3 -> v1.18.5
prometheus v2.35.0 -> v2.37.0
golangci-lint v1.46.2 -> v1.48.0

Several go dependencies, among them:

k8s v1.24.2 -> v1.24.4

cloudbuild image
v20211118-2f2d816b90 -> v20220609-2e4c91eb7e

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* graduate new endpoint metrics to STABLE

* graduate kube_endpoint_ports and kube_endpoint_address to STABLE
* graduate kube_endpoint_address_not_ready and
  kube_endpoint_address_available to DEPRECATED as the information is
precomputed during metrics-scraping

Signed-off-by: Mario Constanti <mario@constanti.de>

* feat: Add local storage labels to kube_persistentvolume_info

Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>

* Make code inline with the other parts

Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>

* Remove RBAC resources from default exposed metric set

These resources might create a lot of metrics,
so we won't enable them by default for now.

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Prevent multiple custom resource configurations for the same resource

* Add host path metrics, fix table formatting

Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>

* *: Cut v2.6.0

* add exit code

* fix unitests

* create new metric

* unit tests

* remove reason from exitcode

* fix test

* documentation

* Promote two metrics to stable

* Update ci.yml

Signed-off-by: sashashura <93376818+sashashura@users.noreply.github.com>

* add ContainerResourceSourceType for hpa metrics and reduce cyclomatic complexity

* fix/docs: Typo correction in SA metrics

Typo correction in SA metrics documentation in the summary

* Do not expose ingress path metric when service is nil

* Fix return blank string when Service is nil

* Fix none to blank string

* Handle singular labels in allowlist

Handle singular labels in allowlist failing when such a label is
supplied, in order to keep the behaviour in sync with --resources.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Allow Lease metrics to be exported across all namespaces

* update lease store to export namespace

* go.mod: Upgrade to k8s 1.25

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Build with go 1.19

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* README.md: Replace Kubernetes compat matrix

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Represent GVK information as labels

Represent GVK information as labels in the metrics, instead of appending
them to the metric name itself. This would allow users to aggregate
varying GVKs of a CR under the same metric, making operations much more
easier.

* export Lease.Spec.HolderIdentity

* account for nil holderIdentity case

* Import k8s metrics stability framework

* Upgrade dependency component-base to v0.25.2 and add TODO

* Add all stable metrics

* customresourcestate fix type indentation in example

* fixup! Represent GVK information as labels

* fixup! fixup! Represent GVK information as labels

* Harden and add gosec linter

Remediate:
G104: Errors unhandled.
G109: Potential Integer overflow made by strconv.Atoi result conversion to int16/32
G112: Potential Slowloris Attack because ReadHeaderTimeout is not configured in the http.Server
G304: Potential file inclusion via variable
G601: Implicit memory aliasing in for loop.

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* .github: Update actions

Context: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* add myself to OWNERs

* Fix typos

* One more typo fix

* Clarify valueFrom comment

* e2e: Test against k8s 1.25.0

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Makefile: Build with go 1.19.3

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* go.mod: Bump dependencies

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Update OWNERS

* Recommend kube-scheduler alternatives

Recommend metrics exposed by the kube-scheduler, in case of the ones below:
* `kube_pod_container_resource_limits`
* `kube_pod_container_resource_requests`

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Allow `labelFromKey` field

Allow `labelFromKey` field for the following types:
* Gauge: Done.
* Info: Done.
* StateSet: N/A (redundant use case, see doc changes for more info).

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Address vulns reported by the security checks

As of this moment, there are 2 vulns in the codebase.
* GO-2022-1095: os/exec@go1.19.2
* GO-2022-1095: syscall@go1.19.2

This commit aims to fix that.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Sharding per node

* Validate options

* Clean

* Move merging fieldselectors into app/server.go and replace namespaceFitler with fieldSelectorFilter

* Refactoring

* Provide scaling example

* Add `govulncheck` cron

Add `govulncheck` cron configuration.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Update README.md

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

* Refactoring

* introduce custom-resources-only flag ...

... to only monitor all known custom-resource configurations instead of
listing each of them explicitly

Signed-off-by: Mario Constanti <mario@constanti.de>

* Support filtering label allowlist by "*"

Support filtering label allowlist by "*", which will expand to the
enabled resources, while infering their values based on its value(s).

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Add retention policy metrics for KEP-1847

* fixup! Allow `labelFromKey` field

* fixup! fixup! Allow `labelFromKey` field

* fixup! fixup! fixup! Allow `labelFromKey` field

* Deprecate VPA

Deprecate VPA metrics in v2.9.0.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* fixup! fixup! fixup! fixup! Allow `labelFromKey` field

* go.mod: Bump exporter-toolkit to 0.8.1

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Incorporate Cobra

s/pflags/cobra/g:
* Use spf13/cobra to handle all flags and sub-commands.
* Remove all spf13/pflag usage, and fallback to the in-build flag
  package if, and when needed.
* Add completion support.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* update kube_node_status_{capacity/allocatable} doc to clarify difference

* Add node deletionTimestamp metric

Adds deletionTimestamp metric, for nodes.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Add --config flag

--config flag defines the path to the kube-state-metrics options config file.

* Implement hot-reloading based on config changes

Reload KSM on-the-fly when a change is detected in the configuration.

Meta changes (will squash)

* Add end-to-end testing to verify hot-reloading

Add end-to-end testing to verify hot-reloading for event-based changes
stemming from the config file. Also,
* sent in a doc fix that was missed
earlier:
https://github.com/kubernetes/kube-state-metrics/pull/1890/files#diff-380eca5a922c0ddbf67f04daefc6823e7ef0e197434d3a826d39c7063cdfa6d6R15,
* updated fsnotify and viper dependencies (v1.6.0 and v1.14.0
  respectively).

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* Add rexagod as reviewer

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* add ingress class metrics

* add ingress classs as nondefault resource

* add stability experimental

* autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+

Issue: kubernetes#1711

Problem: kube-state-metrics supports up to (latest k8s release - 3) k8s
version. Since v1.25 has been release we can update it to start using
autoscaler/v2.

Solution: update packages to start using autoscaler/v2

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

* Update internal/store/builder.go

Co-authored-by: Damien Grisonnet <damien.grisonnet@epita.fr>

* fix linter error

* go.mod: Update dependencies

* *: Cut v2.7.0

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Add Metrics for EndpointSlices

Implements https://pkg.go.dev/k8s.io/api/discovery/v1#EndpointSlice

This resourcetype is disabled by default as they are very verbose and
have a high cardinality.
Metrics from endpointslices can be used to identify if specific pods are
part of an endpoint and thus discoverable through a service.

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Add CatherineF-dev as reviewer

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* Update github.com/prometheus/exporter-toolkit

The latest version (v0.8.2) fixes CVE-2022-46146.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Add metrics for config file changes

This uses code pieces from prometheus/alertmanager in https://github.com/prometheus/alertmanager/blob/main/config/coordinator.go#LL56C26-L56C26
licensed under Apache-2.0.

kube_state_metrics_config_hash{type="config", filename="config.yml"} 4.0061079457904e+13
kube_state_metrics_config_last_reload_success_timestamp_seconds{type="config", filename="config.yml"} 1.6697483049487052e+09
kube_state_metrics_config_last_reload_successful{type="config",
filename="config.yml"} 1

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* Replace "<none>" with empty string for "owner_kind", "owner_name" and "owner_is_controller" dimensions.

Returning empty string instead of "<none>" value for "owner_kind", "owner_name" and "owner_is_controller" dimensions when no metadata.ownerReferences exists in Kubernetes resoures.

* Replace special string "<none>" with empty string for "storageclass" dimension.

Returning empty string instead of special string "<none>" for "storageclass" dimensions of "kube_persistentvolumeclaim_info" metric.

* Update internal/store/persistentvolumeclaim.go

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

* Remove duplicated MetricsWriter implementation

Simplify the implementation of the MetricsWriter to avoid code
duplication between single and multi stores scenarios.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* pkg/metrics_store: add error handling to WriteAll

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* Add metrics for CustomResourceConfig file

* docs: add status badge for the periodic cronjob

Additionally, rename it to a more fitting name.

* Rename references from master to main branch

* Add kube_pod_status_qos_class gauge to pod metrics

Signed-off-by: frezes <zhangjunhao@kubesphere.io>

* Makefile: Bump to go 1.19.4

* Makefile: Bump to prometheus 2.40.6

* go.mod: Bump to kubernetes 1.26.0

* .github: Bump golangci-lint to 1.50.1

* Make CRS metrics type dynamic

All CRS metrics are hardcoded to "gauge" type, this patch addresses
that.

* Add develop doc on adding new metrics

* Update docs/developer/guide.md

Co-authored-by: Pranshu Srivastava <prasriva@redhat.com>

* Change kube_pod_status_qos_class to experimental metric

Signed-off-by: frezes <zhangjunhao@kubesphere.io>

* docs: Fix typo in kube_node_deletion_timestamp description

Signed-off-by: Grzegorz Głąb <grzesuav@gmail.com>

* docs:fix link

* prefix GVK labels in CustomResourceMonitoring

This will prefix the auto-generated GVK labels for CustomResources with
customresource_ to make it a bit more clear that these labels got generated.

Signed-off-by: Mario Constanti <mario@constanti.de>

* Reload CustomResourceState Config File on Change

This change adds hot reloading support for the customresourcestate
config file.

It also resolves a bug in which the customresourcestate config file was
included in the ksm config file, in which it did not get detected.

It also resolves a bug in which customresourcestatemetrics were not
added when set resources were non-default resources.

Fixes: kubernetes#1892

* replace kube_crd with kube_customresource in docs

Signed-off-by: Mario Constanti <mario@constanti.de>

* fix vpa crd metric names

* update crd monitoring and mention new flags

* Support pod_ready_time and pod_container_ready_time

Co-authored-by: Szymon Grzemski <sz.grzemski@gmail.com>
Signed-off-by: Lan Liang <gcslyp@gmail.com>

* Addressed feedback

* Fixing bad merge in rebase

* Update development guide

* fix --version flag

* Added a test for kube_pod_status_container_ready_time

* tools: Use own go.mod

This reduces the number of deps that will be fetched for building ksm

* clean broken --version flag

* remove unwanted change

* update doc

* Enhance UserAgent with more information

Before:
"User-Agent: v2.7.0"

After:
"User-Agent: kube-state-metrics/v2.7.0 (linux/amd64) kubernetes/1cda0bf9"

* Handle unit length `valueFrom` values

Handle unit length `valueFrom` values and skip strings where we expect
them to be type-cast-able to `float64`, instead of erroring, since that
is the expected behavior, and what's being done for other types.

* support "True" and "False" as string in custom-resource-state for operator status conditions

* lowercase string and simplify logic

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

* Improve command to install tools

* Document how gauges convert types to float

* Improve formating

* Use structured logging

* Fixing emitting of ready time metrics when condition is false

* Set OpenMetrics content header

See: https://github.com/prometheus/common/blob/main/expfmt/encode.go#L86

See: kubernetes#1973

* Migrate all NewFamilyGenerator to NewFamilyGeneratorWithStability

* Clean up NewFamilyGenerator in comments

* Update docs/customresourcestate-metrics.md

* fixup! Handle unit length `valueFrom` values

* Update docs/customresourcestate-metrics.md

* go.mod: Update dependencies

* *: Cut v2.8.0

* fix: public Builder compatibility with the BuilderInterface

* Don't crash on non-existent path values

Don't crash on non-existent path values in CRS.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

* fixup! Don't crash on non-existent path values

* build: Bump to go 1.19.6

* *: Cut v2.8.1

* Only use OpenMetrics and Text in contentType

* Update pkg/metricshandler/metrics_handler.go

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

* Update golangci-lint version

* Update golang and go deps

Signed-off-by: Manuel Rüger <manuel@rueg.eu>

* *: Cut v2.8.2

* checked out upstream upgrade tag and replayed custom changes on top

Signed-off-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>

* removed travis CI yaml file

Signed-off-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>

* updated Dockerfile.prow to try go1.19

Signed-off-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>

* updated go.mod to specify go1.20

Signed-off-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>

---------

Signed-off-by: dmpe <John Malc> <cincenko@outlook.com>
Signed-off-by: Manuel Rüger <manuel@rueg.eu>
Signed-off-by: Mario Constanti <mario@constanti.de>
Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>
Signed-off-by: sashashura <93376818+sashashura@users.noreply.github.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: frezes <zhangjunhao@kubesphere.io>
Signed-off-by: Grzegorz Głąb <grzesuav@gmail.com>
Signed-off-by: Lan Liang <gcslyp@gmail.com>
Signed-off-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>
Co-authored-by: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>
Co-authored-by: Kaito Ii <kaitoii1111@gmail.com>
Co-authored-by: Manuel Rüger <manuel@rueg.eu>
Co-authored-by: dmpe <cincenko@outlook.com>
Co-authored-by: JUN YANG <yang.jun22@zte.com.cn>
Co-authored-by: Christian Schlotter <christi.schlotter@gmail.com>
Co-authored-by: Mario Constanti <mario@constanti.de>
Co-authored-by: m.nabokikh <maksim.nabokikh@flant.com>
Co-authored-by: Shaun Sabo <shaun.sabo@gmail.com>
Co-authored-by: Catherine Fang <yinghongfang@google.com>
Co-authored-by: Alex <93376818+sashashura@users.noreply.github.com>
Co-authored-by: whitebear009 <whitebear009@163.com>
Co-authored-by: Quentin DUPUY <86727342+qdupuy@users.noreply.github.com>
Co-authored-by: evir35 <evir35@gmail.com>
Co-authored-by: Pranshu Srivastava <rexagod@gmail.com>
Co-authored-by: Lanting Chiang <lanting.chiang@robinhood.com>
Co-authored-by: Han Kang <hankang@google.com>
Co-authored-by: Pavel Timofeev <timp87@gmail.com>
Co-authored-by: Matthew Cary <mattcary@google.com>
Co-authored-by: Akshit Tyagi <37214399+exitflynn@users.noreply.github.com>
Co-authored-by: Damien Grisonnet <dgrisonn@redhat.com>
Co-authored-by: JoaoBraveCoding <jmarcal@redhat.com>
Co-authored-by: Damien Grisonnet <damien.grisonnet@epita.fr>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Paweł Kubica <pawel.kubica@comarch.pl>
Co-authored-by: Paweł Kubica <49437476+pawcykca@users.noreply.github.com>
Co-authored-by: frezes <zhangjunhao@kubesphere.io>
Co-authored-by: Pranshu Srivastava <prasriva@redhat.com>
Co-authored-by: Grzegorz Głąb <grzesuav@gmail.com>
Co-authored-by: yosshi825 <syoshimu@zlab.co.jp>
Co-authored-by: Lan Liang <gcslyp@gmail.com>
Co-authored-by: Szymon Grzemski <sz.grzemski@gmail.com>
Co-authored-by: Ryan Olds <ryanrolds@gmail.com>
Co-authored-by: Ryan R. Olds <rolds@squarespace.com>
Co-authored-by: Benjamin Jorand <benjamin.jorand@gmail.com>
Co-authored-by: Jan Kantert <jan-mpf@kantert.net>
Co-authored-by: jabdoa2 <jabdoa2@users.noreply.github.com>
Co-authored-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com>
Co-authored-by: Artur Rodrigues <arturhoo@gmail.com>
Co-authored-by: Nathaniel Graham <[nathaniel.graham@protonmail.com, ngraham@redhat.com]>
@Efrat19
Copy link

Efrat19 commented Mar 28, 2024

Hi! I see this happening again for the new version 2.11.0 (2.10.1 is fine)
A curl to a ksm pod 10.254.30.65 with image registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0:

curl --header "Accept: application/openmetrics-text" -v http://10.254.30.65:8080/metrics 
*   Trying 10.254.30.65:8080...
* Connected to 10.254.30.65 (10.254.30.65) port 8080
> GET /metrics HTTP/1.1
> Host: 10.254.30.65:8080
> User-Agent: curl/8.7.1
> Accept: application/openmetrics-text
> 
* Request completely sent off
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Thu, 28 Mar 2024 08:52:44 GMT
< Transfer-Encoding: chunked
....
a_stateset_metric{label=value} 1
....
* Connection #0 to host 10.254.30.65 left intact

The returned content type is text/plain and there's no # EOF char, so prometheus fails to scrape it.

@kallaics
Copy link

Hi!

I also recognized a same issue after last upgrade. I got on all of our clusters the invalid metric type "info" message by Prometheus scrape yesterday. I did do a rollback and the older version is fine.

Issue appeared on

  • KSM: v2.11.0
  • Prometheus version: v2.51.0

Last version works fine on

  • KSM: v2.10.1
  • Prometheus: v2.50.1

I tested the output of KSM and looks fine by me.

wget -q -O - http://10.244.4.148:8080/metrics

Sample output:

# HELP app_build_info <description>
# TYPE app_build_info info
app_build_info{customresource_group="apps",customresource_kind="Deployment",customresource_version="v1",exported_namespace="<namespace name>",instance="<instance name>"} 1
...

I checked the KSM and the Prometheus release notes, changelog yesterday, but I didn't found anything regarding to this on KSM or Prometheus side.

I found yesterday and I was not sure it is related to the KSM or the Prometheus itself.

@fischerman
Copy link

I checked the KSM and the Prometheus release notes, changelog yesterday, but I didn't found anything regarding to this on KSM or Prometheus side.

I also can't find anything in the change log. I can however confirm that only reverting KMS is sufficient. I tested both v2.11.0 and v2.10.1 with Prometheus 2.51.1. So something must have changed in KMS.

@rexagod You mentioned in #2270 (comment) that #2270 should fix this issue, too. From what I understand #2270 only applies to protobuf expositions. We only negotiate text-based formats in Prometheus:

  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4

Can the two issues still be related?

@rexagod
Copy link
Member

rexagod commented Apr 3, 2024

Hello, this should be fixed in the latest v2.12.0 release. While the manifests haven't been merged yet, to test things out, you can checkout the latest tag.

PTAL at all three cases below. These are expected results.

┌[rexagod@nebuchadnezzar] [/dev/ttys004] 
└[~/repositories/misc]> curl --header "Accept: application/openmetrics-text" -v localhost:8080/metrics           
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: application/openmetrics-text
> 
< HTTP/1.1 200 OK
< Content-Type: application/openmetrics-text; version=0.0.1; charset=utf-8; escaping=values
< Date: Wed, 03 Apr 2024 08:46:31 GMT
< Content-Length: 580
< 
# HELP kube_customresource_foo_info foo_help
# TYPE kube_customresource_foo_info gauge
kube_customresource_foo_info{customresource_group="contoso.com",customresource_kind="MyPlatform",customresource_version="v1alpha1",name="test-dotnet-app"} ...
...
# EOF
* Connection #0 to host localhost left intact
┌[rexagod@nebuchadnezzar] [/dev/ttys004] 
└[~/repositories/misc]> curl --header "Accept: application/vnd.google.protobuf" -v localhost:8080/metrics
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: application/vnd.google.protobuf
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Wed, 03 Apr 2024 08:49:15 GMT
< Content-Length: 574
< 
# HELP kube_customresource_foo_info foo_help
# TYPE kube_customresource_foo_info gauge
kube_customresource_foo_info{customresource_group="contoso.com",customresource_kind="MyPlatform",customresource_version="v1alpha1",name="test-dotnet-app"} ...
...
* Connection #0 to host localhost left intact
┌[rexagod@nebuchadnezzar] [/dev/ttys004] 
└[~/repositories/misc]> curl --header "Accept: application/plain-text" -v localhost:8080/metrics
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: application/plain-text
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Wed, 03 Apr 2024 08:48:52 GMT
< Content-Length: 574
< 
# HELP kube_customresource_foo_info foo_help
# TYPE kube_customresource_foo_info gauge
kube_customresource_foo_info{customresource_group="contoso.com",customresource_kind="MyPlatform",customresource_version="v1alpha1",name="test-dotnet-app"} ...
...
* Connection #0 to host localhost left intact

@rexagod
Copy link
Member

rexagod commented Apr 3, 2024

Feel free to ping this thread if things are still breaking after v2.12.

@fischerman
Copy link

Hi @rexagod, good news first: v2.12 fixes the issue for us and probably most users with a standard Prometheus configuration.

However I could not reproduce your example and the Prometheus exposition format of KMS is broken. This impacts everyone who uses experimental features. I've tested multiple versions of KMS and different exposition formats. For this I used Prometheus (v2.51.1) with the following config:

global:
  scrape_interval: 10s
# scrape_protocols:
# - PrometheusText0.0.4
scrape_configs:
  - job_name: "kms"
    static_configs:
      - targets: ["localhost:8080"]

The default scrape_protocols are:

scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4

Here are the results:

  • v2.10.1
    • Prometheus default scrape protocols: KMS returns OM 1.0 + #TYPE info. WORKS
    • Prometheus asks for PrometheusText0.0.4 only: KMS returns text/plain (which means Prometheus text) + #TYPE info. invalid metric type "info"
    • Prometheus asks for PrometheusProto only: KMS returns text/plain (which means Prometheus text) + #TYPE info. invalid metric type "info"
  • v2.11.0
    • Prometheus default scrape protocols: KMS returns text/plain (which means Prometheus text) + #TYPE info. invalid metric type "info"
  • v2.12.0
    • Prometheus default scrape protocols: KMS returns OM 1.0 + #TYPE info. WORKS but yields a different result then your curl example above.
    • Prometheus asks for PrometheusText0.0.4 only: KMS returns text/plain (which means Prometheus text) + #TYPE info. invalid metric type "info"
    • Prometheus asks for PrometheusProto only: KMS returns text/plain (which means Prometheus text) + #TYPE info. invalid metric type "info"

So as far as I can tell v2.12.0 only works if Prometheus asks for OM. KMS should not return type info if the exposition format is Prometheus (in this case text).

It's weird. It looks a little bit like v2.10.1 and v2.12.0 behave the same, almost like I didn't change the image properly. But I checked three times. I've used registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.12.0.

@rexagod
Copy link
Member

rexagod commented Apr 3, 2024

@fischerman That's weird. Can you checkout to the latest tag on your local (v2.12.0) and see if this is still reproduced (preferably using curl, so we can be sure any issues arising reside within KSM)? This would also ensure that this wasn't due an image replacement issue.

@kallaics
Copy link

kallaics commented Apr 8, 2024

I tested on v2.12.0 and I have an issue with custom resource state definitions. It looks like not iterating over the items correctly, when the metric name is same. I suppose the code probably override with the last item instead of append to the result, when the metric name is same.

I have Kustomization, HelmRelease, GitRepository, HelmRepository, Bucket, HelmChart, ImageRepository, ImagePolicy, ImageUpdateAutomation, Provider, Receiver and Alert resources with same metrics name and I just got back one, the HelmRelease.

This definition works well:

metric1
  - resource1

This not works well:

metric1:
  - resource1
  - resource2
  ...
  - resourceX

Same config worked well with version v2.10.1.

@mrueg
Copy link
Member

mrueg commented Apr 8, 2024

I tested on v2.12.0 and I have an issue with custom resource state definitions. It looks like not iterating over the items correctly, when the metric name is same. I suppose the code probably override with the last item instead of append to the result, when the metric name is same.

I have Kustomization, HelmRelease, GitRepository, HelmRepository, Bucket, HelmChart, ImageRepository, ImagePolicy, ImageUpdateAutomation, Provider, Receiver and Alert resources with same metrics name and I just got back one, the HelmRelease.

This definition works well:

metric1
  - resource1

This not works well:

metric1:
  - resource1
  - resource2
  ...
  - resourceX

Same config worked well with version v2.10.1.

Can you create a new issue and provide a real config? I believe that this is unrelated to this issue.

@kallaics
Copy link

kallaics commented Apr 8, 2024

I tested on v2.12.0 and I have an issue with custom resource state definitions. It looks like not iterating over the items correctly, when the metric name is same. I suppose the code probably override with the last item instead of append to the result, when the metric name is same.
I have Kustomization, HelmRelease, GitRepository, HelmRepository, Bucket, HelmChart, ImageRepository, ImagePolicy, ImageUpdateAutomation, Provider, Receiver and Alert resources with same metrics name and I just got back one, the HelmRelease.
This definition works well:

metric1
  - resource1

This not works well:

metric1:
  - resource1
  - resource2
  ...
  - resourceX

Same config worked well with version v2.10.1.

Can you create a new issue and provide a real config? I believe that this is unrelated to this issue.

I can create an issue, but it happened on time with it.
It worked with v2.10.1 well.
The Prometheus provided the "invalid metric type" error message in v2.11.0 as I written above (#1973 (comment))
The current state is in my last commit. The configuration not changed since v2.10.1 as I saw it well.
This is the current state.

Issue created #2366

Many thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants