Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

Closed
slopezz opened this issue Apr 14, 2021 · 8 comments
Labels
kind/documentation Categorizes issue or PR as related to documentation. language/ansible Issue is related to an Ansible operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.
Milestone

Comments

@slopezz
Copy link

slopezz commented Apr 14, 2021

Bug Report

I'm using operator-sdk 1.5.0 and I'm trying to gather operator metrics without success.

What did you do?

Deployed default operator-sdk v1.5.0 with prometheus metrics enabled at kustomize config level (config/default/kustomization.yaml). I'm using kube-rbac-proxy:v0.5.0 because of issue #4684, but I don't think it affects.

  • It is being created the expected controller-manager deployment with kube-proxy metrics enabled at port 8443 (I just copy/paste the relevant parts of the deployed yaml):
kind: Deployment
apiVersion: apps/v1
metadata:
  name: prometheus-exporter-operator-controller-manager
  namespace: prometheus-exporter
   spec:
      containers:
        - name: kube-rbac-proxy
          image: 'gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0'
          args:
            - '--secure-listen-address=0.0.0.0:8443'
            - '--upstream=http://127.0.0.1:8080/'
            - '--logtostderr=true'
            - '--v=10'
          ports:
            - name: https
              containerPort: 8443
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
        - name: manager
...
          env:
            - name: ANSIBLE_GATHERING
              value: explicit
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: 'metadata.annotations[''olm.targetNamespaces'']'
          imagePullPolicy: IfNotPresent
          terminationMessagePolicy: File
          image: 'quay.io/3scale/prometheus-exporter-operator:v0.3.0'
          args:
            - '--metrics-addr=127.0.0.1:8080'
            - '--enable-leader-election'
            - '--leader-election-id=prometheus-exporter-operator'
...   
  • It is being created the expected metrics Service:
kind: Service
apiVersion: v1
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-service
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
    operators.coreos.com/prometheus-exporter-operator.prometheus-exporter: ''
spec:
  ports:
    - name: https
      protocol: TCP
      port: 8443
      targetPort: https
  selector:
    control-plane: controller-manager
  clusterIP: 172.30.117.225
  type: ClusterIP
  sessionAffinity: None
  • It is being created the expected ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-monitor
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
spec:
  endpoints:
    - path: /metrics
      port: https
  selector:
    matchLabels:
      control-plane: controller-manager

What did you expect to see?

ServiceMonitor achieves to scrape operator metrics (an so, metric up=1).

What did you see instead? Under which circumstances?

Service monitor failing (metric up=0):

up{container="kube-rbac-proxy",endpoint="https",instance="10.129.2.246:8443",job="prometheus-exporter-operator-controller-manager-metrics-service",namespace="prometheus-exporter",pod="prometheus-exporter-operator-controller-manager-669f6fbdcc2jbm7",prometheus="openshift-user-workload-monitoring/user-workload",service="prometheus-exporter-operator-controller-manager-metrics-service"} | 0

Environment

Operator type:

/language ansible

Kubernetes cluster type: Openshift v4.6

$ operator-sdk version

operator-sdk version: "v1.5.0", commit: "98f30d59ade2d911a7a8c76f0169a7de0dec37a0", kubernetes version: "1.19.4", go version: "go1.15.5", GOOS: "linux", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-dispatcher", GitCommit:"fd22db44e150011eccc8729db223945384460143", GitTreeState:"clean", BuildDate:"2020-07-24T07:27:52Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+bd9e442", GitCommit:"bd9e4421804c212e6ac7ee074050096f08dda543", GitTreeState:"clean", BuildDate:"2021-02-11T23:05:38Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

N/A

Additional context

If I connect to the controller-manager, manager container, I can check the metrics served through manager protected port 8080 (only available at 127.0.0.1):

$ kubectl exec -it prometheus-exporter-operator-controller-manager-669f6fbdcc2jbm7 -c manager -- /bin/bash

bash-4.4$ curl 127.0.0.1:8080/metrics
# HELP ansible_operator_build_info Build information for the ansible-operator binary
# TYPE ansible_operator_build_info gauge
ansible_operator_build_info{commit="98f30d59ade2d911a7a8c76f0169a7de0dec37a0",version="v1.4.0+git"} 1
# HELP ansible_operator_reconcile_result Gauge of reconciles and their results.
# TYPE ansible_operator_reconcile_result gauge
ansible_operator_reconcile_result{GVK="monitoring.3scale.net/v1alpha1, Kind=PrometheusExporter",result="succeeded"} 6
# HELP ansible_operator_reconciles How long in seconds a reconcile takes.
# TYPE ansible_operator_reconciles histogram
ansible_operator_reconciles_bucket{GVK="monitoring.3scale.net/v1alpha1, Kind=PrometheusExporter",le="0.005"} 6
...

However, if I try to access to the port published through the kube-proxy port, it fails (both http/https schema), which I guess is what prometheus is trying to do with the deployed ServiceMonitor, so failing):

bash-4.4$ curl 127.0.0.1:8443/metrics
Client sent an HTTP request to an HTTPS server.


bash-4.4$ curl https://127.0.0.1:8443/metrics
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Maybe there are 2 problems?

  • Current ServiceMonitor tries to scrape using HTTP schema, but the port is offering HTTPS?
  • In addition, although ServiceMonitor uses HTTPS schema (not the case), the certificate is selfsigned, and maybe prometheus would refuse it anyway?
@openshift-ci-robot openshift-ci-robot added the language/ansible Issue is related to an Ansible operator project label Apr 14, 2021
@criscola
Copy link

criscola commented Apr 15, 2021

Hello, I had the exact same problem and it brought me a lot of headache. Try the following:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-monitor
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
spec:
  endpoints:
    - path: /metrics
      port: https
      scheme: https
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      tlsConfig:
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: controller-manager

@slopezz
Copy link
Author

slopezz commented Apr 15, 2021

Thanks for posting that solution @criscola, actually your suggestion makes total sense.

I have applied that change and deploy it:

$ git diff
diff --git a/config/prometheus/monitor.yaml b/config/prometheus/monitor.yaml
index 1b44d4f..a5bd8b1 100644
--- a/config/prometheus/monitor.yaml
+++ b/config/prometheus/monitor.yaml
@@ -11,6 +11,10 @@ spec:
   endpoints:
     - path: /metrics
       port: https
+      scheme: https
+      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+      tlsConfig:
+        insecureSkipVerify: true
   selector:
     matchLabels:
       control-plane: controller-manager



$ make deploy 
cd config/manager && /home/slopez/bin/kustomize edit set image controller=quay.io/3scale/prometheus-exporter-operator:v0.3.0
/home/slopez/bin/kustomize build config/manual | kubectl apply -f -
namespace/prometheus-exporter-operator-system created
customresourcedefinition.apiextensions.k8s.io/prometheusexporters.monitoring.3scale.net created
serviceaccount/prometheus-exporter-operator-controller-manager created
role.rbac.authorization.k8s.io/prometheus-exporter-operator-leader-election-role created
role.rbac.authorization.k8s.io/prometheus-exporter-operator-manager-role created
clusterrole.rbac.authorization.k8s.io/prometheus-exporter-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/prometheus-exporter-operator-proxy-role created
rolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-proxy-rolebinding created
service/prometheus-exporter-operator-controller-manager-metrics-service created
deployment.apps/prometheus-exporter-operator-controller-manager created
servicemonitor.monitoring.coreos.com/prometheus-exporter-operator-controller-manager-metrics-monitor created

But now prometheus cannot scrape the operator, before I had prometheus up=0 because the target was down, but now nothing, like if prometheus is ignoring that ServiceMonitor after applying that 3 changes.

It might be caused by a total unrelated problem regarding the monitoring stack I'm using, which is the openshift user-workload-monitoring stack (let's say, the official way of monitoring user workloads on openshift).

If I get into a prometheus pod from openshift user-workload-monitoring stack (for example container config-reloaded), and I execute the curl that prometheus should use with configured the Servicemonitor`, it works fine and I can get the metrics:

$ oc project openshift-user-workload-monitoring
Now using project "openshift-user-workload-monitoring" on server "https://api.....net:6443".

$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-849fdfdcb5-ktqjd   2/2     Running   0          29d
prometheus-user-workload-0             4/4     Running   1          29d
prometheus-user-workload-1             4/4     Running   1          2d22h
thanos-ruler-user-workload-0           3/3     Running   3          29d
thanos-ruler-user-workload-1           3/3     Running   3          29d


$ kubectl exec -it prometheus-user-workload-1 -c config-reloader -- /bin/bash

bash-4.4$ cat /var/run/secrets/kubernetes.io/serviceaccount/token
qy......xAba
 
bash-4.4$ curl --insecure https://prometheus-exporter-operator-controller-manager-metrics-service.prometheus-exporter-operator-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer qy......xAba"
# HELP ansible_operator_build_info Build information for the ansible-operator binary
# TYPE ansible_operator_build_info gauge
ansible_operator_build_info{commit="98f30d59ade2d911a7a8c76f0169a7de0dec37a0",version="v1.4.0+git"} 1
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="prometheusexporter-controller"} 0
....

However it seems that the prometheus is ignoring the ServiceMonitor because of having the bearerTokenFile field defined. If I remove that field from the ServiceMonitor, prometheus scrape fails (normal), but a least I can see hat prometheus is trying to scrape it with giving up=0, but once I add the bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token there is nothing.

@slopezz
Copy link
Author

slopezz commented Apr 16, 2021

I have checked that new operator-sdk:v1.6.1 (from yesterday) already implements the suggested ServiceMonitor modifications at #4680:

+      scheme: https
+      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+      tlsConfig:
+        insecureSkipVerify: true

And I have found that my particular problem is caused because using Openshift User Workload Monitoring stack, Iif I get into on prometheus-operator pod logs, I can see the following warning:

level=warn ts=2021-04-15T14:55:59.541363474Z caller=operator.go:1636 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=prometheus-exporter-operator-system/prometheus-exporter-operator-controller-manager-metrics-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload 

So the ServiceMonitor with the bearerTokeFile field is ignored (skipped), because of prometheus configuration, it is prohibited.

After commenting it with Openshift monitoring team, it is skiped because of arbitraryFSAccessThroughSMs, which it is set to false to limit potential security issues (to not let scraped targets to get access to the prometheus service account's token).

They suggested me to maybe use bearerTokeSecret instead of bearerTokeFile, but after doing myself a couple of tests (and without being an expert on the matter), I see a couple of issues:

  • From one side, bearerTokenSecret requires the name of a secret, and secrets holding the tokens from ServiceAccounts have random objects names (so there is no way to know the name of the secret before creating the ServiceAccount, and operator-sdk requires predictability of object names for the scaffolding)
  • In addition, normally operator use Roles (not ClusterRoles), and it seems that the permission required can only be added to ClusterRoles (- nonResourceURLs: - /metrics / verbs: - get)

In addition, Openshift monitoring team told me that we should bear in mind that using bearer tokens for metrics authn puts additional load on the API server and they are looking at replacing this by client TLS auth in the future (it's being discussed here: openshift/enhancements#701)

For the moment I will just remove the proxy in front the operator (to be able to access operator metrics without any problem using the Openshift UWM).

So from my point of view issue can be closed now (there is no problem with operator-sdk), but I will let operator-sdk team to decide what to do, because current ServiceMonitor definition won't work on Openshift User Workload Monitoring (the official monitoring stack from Openshift), and maybe I'm missing something, can you think in a way to authenticate to the metrics endpoint that not requires a cluster role or accessing a generated secret?

@kensipe kensipe added kind/documentation Categorizes issue or PR as related to documentation. triage/needs-information Indicates an issue needs more information in order to work on it. labels Apr 19, 2021
@kensipe kensipe added this to the Backlog milestone Apr 19, 2021
@camilamacedo86
Copy link
Contributor

just to share for who will be able to check out and help here. The PR change the related scaffolds so might be valid to check
kubernetes-sigs/kubebuilder#2065 this with the latest scaffold as well.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 18, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Sep 17, 2021
@openshift-ci
Copy link

openshift-ci bot commented Sep 17, 2021

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

redhatHameed added a commit to redhatHameed/dbaas-operator that referenced this issue Oct 4, 2021
…etrics see more detail in issue operator-framework/operator-sdk#4764

Signed-off-by: Abdul Hameed <ahameed@redhat.com>
redhatHameed added a commit to redhatHameed/dbaas-operator that referenced this issue Oct 6, 2021
…etrics see more detail in issue operator-framework/operator-sdk#4764

Signed-off-by: Abdul Hameed <ahameed@redhat.com>
joelddiaz pushed a commit to mondoohq/mondoo-operator that referenced this issue Mar 18, 2022
Using a ServiceMonitor with the bearerTokenFile parameter set causes the
ServiceMonitor to be rejected by the OpenShift user monitoring stack (
operator-framework/operator-sdk#4764 ).

As there is nothing sensitive in the mondoo-operator metrics, just
expose them directly to allow metrics to work under the built-in
OpenShift user metrics monitoring stack.

Add the ability to set some labels on the ServiceMonitor to allow a
functional metrics collection with an out-of-the-box prometheus deployed
as configured in
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
.

Change the kustomize generation so that the kube-rbac-proxy sidecar
container is no longer defined. It really only exists to protect
metrics. Introduce new Service to expose the new metrics ports. Patch
the default Deployment to expose the metrics port. A side benefit of
this is that you don't need to specify the container name when
displaying logs for mondoo-operator as there is now only a single
container.

Signed-off-by: Joel Diaz <joel@mondoo.com>
chris-rock pushed a commit to mondoohq/mondoo-operator that referenced this issue Mar 19, 2022
* Expose metrics for prometheus
* Added Status

Signed-off-by: Harsha <harshaisgud@gmail.com>

* migrate to using new MondooOperatorConfig for metrics

Rather than put the metrics config into the MondooAuditConfig (which is
really for configuring monitoring-specific settings), create a new
MondooOperatorConfig CRD which is cluster-scoped which can be used to
configure operator-wide behavor of the mondoo-operator.

In a cluster with multiple MondooAuditConfigs, it makes no sense to have
one resource with metrics.enabled = true and a different one with
metrics.enabled = false. So just allow a single MondooOperatorConfig to
hold the cluster-wide metrics configuration for the mondoo-operator.

Take the existing ServiceMonitor handling code and call it from the new
mondoooperatorconfig controller.

Extend the MondooOperatorConfig status to hold a list of conditions, and
use this to communicate status for when metrics is enabled, but we
couldn't find Prometheus installed on the cluster.

The conditions handing is written so that a Condition only appears
initially if the Condition.Status is set to True. This means that if you
enable metrics, and Prometheus is found, there will be no
Condition[].Type = PrometheusMissing with .Status = False. Only when
Prometheus is missing will the condition be populated, and of course if
Prometheus transitions from Missing to Found, then the Condition will be
updated to show .Type = PrometheusMissing .Status = False.

Signed-off-by: Joel Diaz <joel@mondoo.com>

* move to http metrics

Using a ServiceMonitor with the bearerTokenFile parameter set causes the
ServiceMonitor to be rejected by the OpenShift user monitoring stack (
operator-framework/operator-sdk#4764 ).

As there is nothing sensitive in the mondoo-operator metrics, just
expose them directly to allow metrics to work under the built-in
OpenShift user metrics monitoring stack.

Add the ability to set some labels on the ServiceMonitor to allow a
functional metrics collection with an out-of-the-box prometheus deployed
as configured in
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
.

Change the kustomize generation so that the kube-rbac-proxy sidecar
container is no longer defined. It really only exists to protect
metrics. Introduce new Service to expose the new metrics ports. Patch
the default Deployment to expose the metrics port. A side benefit of
this is that you don't need to specify the container name when
displaying logs for mondoo-operator as there is now only a single
container.

Signed-off-by: Joel Diaz <joel@mondoo.com>

Co-authored-by: Joel Diaz <joel@mondoo.com>
redhatHameed added a commit to redhatHameed/dbaas-operator that referenced this issue Apr 11, 2022
…etrics see more detail in issue operator-framework/operator-sdk#4764

Signed-off-by: Abdul Hameed <ahameed@redhat.com>
redhatHameed added a commit to redhatHameed/dbaas-operator that referenced this issue Apr 12, 2022
…etrics see more detail in issue operator-framework/operator-sdk#4764

Signed-off-by: Abdul Hameed <ahameed@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. language/ansible Issue is related to an Ansible operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

6 participants