Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for pod/service monitors for TargetAllocator and skip invalid ones #2328

Merged

Conversation

rashmichandrashekar
Copy link
Contributor

Description:
Fixing a bug - When a podmonitor/svcmonitor CRD (with Targetallocator enabled)has an erroneous field for regex in the relabeling/metric relabeling fields, the crd creation succeeds since it can be a valid string but not a valid regex( ex- .*( ). This causes any subsequent valid CRDs being created from getting picked up.

Link to tracking Issue: #2309

Testing: Tested this out with otel operator setup

@rashmichandrashekar rashmichandrashekar requested a review from a team November 9, 2023 02:17
// based on the service monitor and pod metrics endpoints specs.
// This code borrows from
// https://github.com/prometheus-operator/prometheus-operator/blob/ba536405154d18f3a6f312818283d671182af6f3/pkg/prometheus/resource_selector.go#L237
func validateRelabelConfig(rc monitoringv1.RelabelConfig) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful for us to reach out to the prometheus operator folks to see if they would expose this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaronoff97. I have reached out to them, waiting for a response,(prometheus-operator/prometheus-operator#6057 (comment)) based on that I can add the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also created an issue here - prometheus-operator/prometheus-operator#6076

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaronoff97 - There is a question about breaking changes. I have answered it from what I understand and what I think the way forward would be. Could you please respond to provide your perspective and correct me if i am wrong?
prometheus-operator/prometheus-operator#6076 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i responded there :) thanks for making the issue. Let's wait to hear from them before merging this – i would love to just use their method rather than having to write it ourselves.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests in to the promOperator_test.go file to ensure this works how we expect?

@jaronoff97
Copy link
Contributor

@rashmichandrashekar could you re-implement this PR to use the interfaces recommended by the prometheus operator?

@rashmichandrashekar
Copy link
Contributor Author

@rashmichandrashekar could you re-implement this PR to use the interfaces recommended by the prometheus operator?

@jaronoff97 - For sure, I have the changes all ready. Just cleaning some things up and I will create a PR today. Thanks for helping with this!

@rashmichandrashekar rashmichandrashekar requested a review from a team November 29, 2023 19:30
@rashmichandrashekar
Copy link
Contributor Author

@rashmichandrashekar could you re-implement this PR to use the interfaces recommended by the prometheus operator?

@jaronoff97 - For sure, I have the changes all ready. Just cleaning some things up and I will create a PR today. Thanks for helping with this!

@jaronoff97 - I have reimplemented this and added relevant tests. Please take a look and lmk if you have any feedback. Also a couple tests seem to be timing out, please requeue them if possible. Thanks!

@jaronoff97
Copy link
Contributor

Thank you! I'll try to take look today or monday.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits/ questions. Does this change how our existing label selector works for the target allocator? I'm wondering if we should also write a new e2e or two to verify this works as expected. cc @swiatekm-sumo

Comment on lines 54 to 55
ServiceMonitorNamespaceSelector map[string]string `yaml:"service_monitor_namespace_selector,omitempty"`
PodMonitorNamespaceSelector map[string]string `yaml:"pod_monitor_namespace_selector,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be defaulting these to empty so that we can keep the current behavior (link)

Copy link
Contributor Author

@rashmichandrashekar rashmichandrashekar Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaronoff97. Since the common prometheus fields have the ServiceMonitorNamespaceSelector and PodMonitorNamespaceSelector already initialized, even if the targetallocator config is not setting these selectors (or have nil selectors), the default behavior will be to match all namespaces which is the current behavior. But I can add defaults, in case you think otherwise. Please let me know.
The following tests also depict the behavior with no selectors in the config vs with selectors.
https://github.com/open-telemetry/opentelemetry-operator/blob/883b583c4c93038d6fddc55b192659c4aa90f0fe/cmd/otel-allocator/watcher/promOperator_test.go#L91C4-L91C34
https://github.com/open-telemetry/opentelemetry-operator/blob/883b583c4c93038d6fddc55b192659c4aa90f0fe/cmd/otel-allocator/watcher/promOperator_test.go#L505C3-L509C6

"ServiceMonitorNamespaceSelector": w.serviceMonitorNamespaceSelector,
} {

sync, err := k8sutil.LabelSelectionHasChanged(old.Labels, cur.Labels, selector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm my understanding here: this event handler waits for any namespace changes in which case we check if one of our namespace selectors has changed as well in which case we send an event to the notifyEvents channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event handler watches for any namespace changes and then calls this method - LabelSelectionHasChanged to basically make sure that the TA's selector yields the same results with old and new labels, if it does then an event is not sent and the new labels lead to a change in behavior then an event is sent.
For ex- If the cfg.podMonitorNamespaceSelector was set to match namespaces with labels "testns" : "testns" and the initial set of namespace labels on for example - namespace1 were "testns" : "testns" and later if the labels were removed, then an event would be sent. But if later, the labels on the namespace were updated to
{
"testns" : "testns",
"testns1": "testns1",
}
it wouldnt send an event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks for the great explanation! 🙇

@rashmichandrashekar
Copy link
Contributor Author

A few nits/ questions. Does this change how our existing label selector works for the target allocator? I'm wondering if we should also write a new e2e or two to verify this works as expected. cc @swiatekm-sumo

Sure @jaronoff97. This doesnt change the existing behavior of the selectors. This just adds a couple other selectors just like in prometheus-operator. I did add unit tests for all these selectors to make sure that the selectors result in the right configuration and select the right pod and svc monitors as well as selectors on namespaces result in picking up pod and svc monitors in the matching namespaces. please lmk if e2e are also necessary.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this change, I really like the watcher cleanup. I do have a concern about this being a breaking change in its current state though - see my comment on the namespace informer.

cmd/otel-allocator/watcher/promOperator.go Outdated Show resolved Hide resolved
@jaronoff97
Copy link
Contributor

@rashmichandrashekar we just talked about this at the operator SIG and want to attempt to get this in prior to our next release. In order to do so, I opened up #2426 to see if we can ensure that we are not going to break users by checking they have the new permission for the TA to function. We're going to pause merging this until we have an answer there.

@rashmichandrashekar
Copy link
Contributor Author

rashmichandrashekar commented Dec 7, 2023 via email

@jaronoff97
Copy link
Contributor

@rashmichandrashekar i've merged in my PR so now you should be able to fix the conflicts and update the necessary TA permissions :)

@rashmichandrashekar
Copy link
Contributor Author

@rashmichandrashekar i've merged in my PR so now you should be able to fix the conflicts and update the necessary TA permissions :)

Thanks so much @jaronoff97 for unblocking this PR. I will get the latest and test it out and update this PR.

@rashmichandrashekar
Copy link
Contributor Author

@rashmichandrashekar i've merged in my PR so now you should be able to fix the conflicts and update the necessary TA permissions :)

Thanks so much @jaronoff97 for unblocking this PR. I will get the latest and test it out and update this PR.

@jaronoff97 - I have updated the PR by fixing conflicts and adding permissions check for namespaces in the webhook. Please take a look. Thanks!

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall, modulo some relatively minor problems.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like to use LabelSelectors directly in configuration instead of repeating the mistake we made with existing ones. @jaronoff97 can you also have a final look?

cmd/otel-allocator/config/config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks a lot for doing this, and your patience in accepting our feedback.

@swiatekm swiatekm requested a review from jaronoff97 January 20, 2024 10:36
Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely fantastic work here. Thank you so much for your patience and contribution. 🎉 the added tests are also great to see.

@jaronoff97 jaronoff97 merged commit 3ee6aad into open-telemetry:main Jan 22, 2024
27 checks passed
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…invalid ones (open-telemetry#2328)

* fix for adding validation

* removing unused references

* adding tests

* adding some changes

* cleaning up

* adding change log

* running goimports and adding return value check code for namespace informer

* fixing lint error

* fixing tests and comment

* adding permissions for e2e tests

* adding cluster roles instead of roles

* updaintg readme

* fixing comments

* adding contant to same block

* fixing lint errors

* running go import

* adding namespaces since that is required for informer

* adding extected warnings

* addressing comments

* adding test for namespace label update

* fixing goimports

* making namespaceselectores as labelselectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants