Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 document metrics scraping and enable metrics services via annotations #4247

Closed

Conversation

bavarianbidi
Copy link
Contributor

What this PR does / why we need it:

Oriented on the metrics documentation from kubebuilder, this PR will add the required annotations on the existing *-metrics-service objects and describe how to configure Prometheus with required ClusterRoles/ClusterRoleBindings to get valid scrape targets.

Mario Constanti mario.constanti@daimler.com, Daimler TSS GmbH, legal info/Impressum

Constanti, Mario added 2 commits March 3, 2021 16:23
To scrape metrics from capi containers, prometheus is mostly configured
to scrape from targets, when the well common prometheus.io annotations
are used.

Signed-off-by: Constanti, Mario <mario.constanti@daimler.com>
Add some more details, how a kubebuilder bootstraped application protect
their metrics endpoint and how prometheus must be configured to scrape
these metrics.

Signed-off-by: Constanti, Mario <mario.constanti@daimler.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @bavarianbidi!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @bavarianbidi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 3, 2021
@bavarianbidi bavarianbidi changed the title 📖 enable metrics scraping 📖 document metrics scraping and enable metrics services via annotations Mar 3, 2021
@fabriziopandini
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 3, 2021
@@ -1,6 +1,10 @@
apiVersion: v1
kind: Service
metadata:
annotations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these changes also be applied in each infra provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will go through the other infra provider and add the missing annotation.

Copy link
Contributor Author

@bavarianbidi bavarianbidi Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CecileRobertMichon

just created three PRs for providers under kubernetes-sigs-Org, because CLA already signed. Have to check the legal stuff on the remaining others first. Hope this doesn't block this PR ;-)

Summary:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about Azure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotation was already there but was removed last year.

I created Issue #1222 to get in contact with Azure team.

Copy link
Contributor Author

@bavarianbidi bavarianbidi Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure PR #1320 created and merged

@CecileRobertMichon
Copy link
Contributor

cc @devigned - this might be interesting to you

@devigned
Copy link
Contributor

devigned commented Mar 4, 2021

Another way we have it configured when you bring up CAPZ in Tilt is to specify the scraping information via a ServiceMonitor.

---
# Prometheus Monitor Service (Metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    control-plane: capz-controller-manager
  name: capz-controller-manager-metrics-monitor
spec:
  endpoints:
    - path: /metrics
      port: https
      scheme: https
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      tlsConfig:
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: capz-controller-manager

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/451d034dc0c0995ab2fe88aa04c69274e8e18f2e/hack/observability/prometheus/resources/prometheus.yaml#L91-L108

I haven't experimented with it, but I bet by adding the annotations it would allow folks to specify less in the endpoints configuration.

+1 to prom annotations

@bavarianbidi
Copy link
Contributor Author

Another way we have it configured when you bring up CAPZ in Tilt is to specify the scraping information via a ServiceMonitor.

---
# Prometheus Monitor Service (Metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    control-plane: capz-controller-manager
  name: capz-controller-manager-metrics-monitor
spec:
  endpoints:
    - path: /metrics
      port: https
      scheme: https
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      tlsConfig:
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: capz-controller-manager

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/451d034dc0c0995ab2fe88aa04c69274e8e18f2e/hack/observability/prometheus/resources/prometheus.yaml#L91-L108

I haven't experimented with it, but I bet by adding the annotations it would allow folks to specify less in the endpoints configuration.

+1 to prom annotations

Yes, but this require the prometheus-operator in place. The annotation-way is much more generic (imho)

@fabriziopandini
Copy link
Member

/milestone v0.4.0

Comment on lines +18 to +24
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: capi-metrics-reader
rules:
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may as well add this to the infra components yaml and remove this step for the end user.

Probably makes sense to leave out the cluster role binding as we don't know the namespace Prometheus may be deployed into.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason i documented it this way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@randomvariable so did I get it right that you suggest removing this YAML here and adding it to the infra component YAMLs of all providers? Maybe I'm missing something but in case we really want to add this to our deployments, it looks to me like we need this ClusterRole only once.

But I'm really not sure if we should add this to our YAMLs so that it's deployed everywhere. Imho Prometheus and RBAC setups can vary and (as far as I'm aware) there was no recurring demand for this in Slack. I assume nobody is really missing this in our YAMLs at the moment.

I have no strong opinion against adding the ClusteRole to our YAMLs, but if we do we should do it right and adding it to all infra providers seems redundant to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if #4640 will be merged, there is no need to add these objects to the infra components yaml. Let's wait what happens to #4640 and discuss again

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Here are some minor nits about documentation style.

docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
Signed-off-by: Constanti, Mario <mario.constanti@daimler.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign fabriziopandini after the PR has been reviewed.
You can assign the PR to them by writing /assign @fabriziopandini in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits. Apart from that we should clarify if we want to move the ClusterRole to our YAMLs. I think we should start with this documentation and with more data / user feedback we can always move it into the YAMLs later on.

docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
Comment on lines +18 to +24
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: capi-metrics-reader
rules:
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@randomvariable so did I get it right that you suggest removing this YAML here and adding it to the infra component YAMLs of all providers? Maybe I'm missing something but in case we really want to add this to our deployments, it looks to me like we need this ClusterRole only once.

But I'm really not sure if we should add this to our YAMLs so that it's deployed everywhere. Imho Prometheus and RBAC setups can vary and (as far as I'm aware) there was no recurring demand for this in Slack. I assume nobody is really missing this in our YAMLs at the moment.

I have no strong opinion against adding the ClusteRole to our YAMLs, but if we do we should do it right and adding it to all infra providers seems redundant to me.

docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
docs/book/src/reference/metrics.md Outdated Show resolved Hide resolved
@sbueringer
Copy link
Member

@bavarianbidi fyi. The metrics port will change through: https://github.com/kubernetes-sigs/cluster-api/pull/4640/files

@sbueringer
Copy link
Member

sbueringer commented May 21, 2021

@bavarianbidi fyi. The metrics port will change through: https://github.com/kubernetes-sigs/cluster-api/pull/4640/files

thanks for this hint. I will stand still until #4640 is merged because we don't need the additional documentation regarding clusterRoles, clusterRoleBindings and serviceAccounts for metrics scraping. The annotations (with adjusted metrics port and scheme) are still valid.

Yup I somehow didn't even think about that this makes this PR also a lot easier :)

@sbueringer
Copy link
Member

/hold

until #4640 is merged

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2021
@k8s-ci-robot
Copy link
Contributor

@bavarianbidi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@randomvariable
Copy link
Member

i don't think there's any metrics to expose now.

@sbueringer
Copy link
Member

sbueringer commented Jun 10, 2021

i don't think there's any metrics to expose now.

How do you mean that?

As far as I'm aware there should be metrics exposed via the metrics endpoint right now. Although, after the latest changes they are only exposed on localhost.

@vincepri
Copy link
Member

Folks, what's the status of this PR?

@sbueringer
Copy link
Member

sbueringer commented Jul 29, 2021

/hold cancel
(as the referenced PR above has been merged)

Since the PR has been initially opened the situation has been changed a bit. Previously, we had kube-rbac-proxy so we needed to:

  • Add service annotations
  • Deploy a ClusterRole
  • Extend the Prometheus config

In the meantime we dropped kube-rbac-proxy but we also are binding the metrics port to localhost so the metrics cannot be scraped at all (per default). I think one way to scrape the metrics now would be:

  • Set "--metrics-bind-addr=0.0.0.0:8080" on all managers
  • Extend the Prometheus config
    (no ClusterRole needed)

But as we didn't want to merge the variant with the metrics port binded to 0.0.0.0 because of security concerns (* No ClusterRole needed), I'm not sure if we want to document that variant?

An alternative would be to document how to

  • Add kube-rbac-proxy sidecar to every manager deployment
  • Add service annotations
  • Deploy a ClusterRole
  • Extend the Prometheus config

Afaik we're now in a situation where it doesn't make sense to adjust any of our release manifests and only document the steps a user has to do based on our manifests.

@randomvariable Regarding the metrics we have. We don't have CAPI specific metrics, but we still have the ones from controller-runtime and go (link). In my experience they can already be used for some basic monitoring and alerting.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 29, 2021
@sbueringer
Copy link
Member

Forgot cc :)
/cc @vincepri @bavarianbidi

@k8s-ci-robot k8s-ci-robot requested a review from vincepri July 29, 2021 05:56
@k8s-ci-robot
Copy link
Contributor

@sbueringer: GitHub didn't allow me to request PR reviews from the following users: bavarianbidi.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Forgot cc :)
/cc @vincepri @bavarianbidi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bavarianbidi
Copy link
Contributor Author

bavarianbidi commented Aug 26, 2021

/hold cancel
(as the referenced PR above has been merged)

Since the PR has been initially opened the situation has been changed a bit. Previously, we had kube-rbac-proxy so we needed to:

  • Add service annotations
  • Deploy a ClusterRole
  • Extend the Prometheus config

In the meantime we dropped kube-rbac-proxy but we also are binding the metrics port to localhost so the metrics cannot be scraped at all (per default). I think one way to scrape the metrics now would be:

  • Set "--metrics-bind-addr=0.0.0.0:8080" on all managers
  • Extend the Prometheus config
    (no ClusterRole needed)

But as we didn't want to merge the variant with the metrics port binded to 0.0.0.0 because of security concerns (* No ClusterRole needed), I'm not sure if we want to document that variant?

An alternative would be to document how to

  • Add kube-rbac-proxy sidecar to every manager deployment
  • Add service annotations
  • Deploy a ClusterRole
  • Extend the Prometheus config

Afaik we're now in a situation where it doesn't make sense to adjust any of our release manifests and only document the steps a user has to do based on our manifests.

@randomvariable Regarding the metrics we have. We don't have CAPI specific metrics, but we still have the ones from controller-runtime and go (link). In my experience they can already be used for some basic monitoring and alerting.

why not document the --metrics-bind-addr=0.0.0.0:8080 variant and add a note about possible security concerns. The variant with kube-rbac-proxy is definitely much more secure, but then i would propose to create an additional section in the docs how to improve the security on CAPI in general by using kube-rbac-proxy.

WDYT @sbueringer / @vincepri

@sbueringer
Copy link
Member

@CoMario Fine for me.

@vincepri
Copy link
Member

Folks, what's the status of this PR?

@sbueringer
Copy link
Member

sbueringer commented Sep 16, 2021

Folks, what's the status of this PR?

@vincepri I think we're waiting from a response from you if the proposed documentation in #4247 (comment) would be okay

@killianmuldoon
Copy link
Contributor

/area health

@k8s-ci-robot
Copy link
Contributor

@bavarianbidi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-test-main-mink8s cb247b9 link /test pull-cluster-api-test-main-mink8s
pull-cluster-api-verify-main cb247b9 link true /test pull-cluster-api-verify-main
pull-cluster-api-test-mink8s-main cb247b9 link true /test pull-cluster-api-test-mink8s-main
pull-cluster-api-e2e-main cb247b9 link true /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fabriziopandini fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022
@enxebre
Copy link
Member

enxebre commented Mar 1, 2022

@bavarianbidi are you still pursuing this PR?

@bavarianbidi
Copy link
Contributor Author

pursuing

@enxebre as i will leave Daimler within the next couple of weeks i will loose access to the mercedes-benz github organization and i'm not able to update this PR anymore.

I'm fine if we close this PR and we create several other PRs as cluster-api-state-metrics will make it into CAPI-Repo and then the monitoring-section in the gitbook needs a general refactoring. WDYT?

@vincepri
Copy link
Member

/close

Closing based on the above comment, if folks want to still pursue documentation later please feel free to reopen different PRs

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closed this PR.

In response to this:

/close

Closing based on the above comment, if folks want to still pursue documentation later please feel free to reopen different PRs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.