-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resetting service type to default when not specified #8165
Conversation
…LoadBalancer type to ClusterIP
I see another bug. cloud-on-k8s/pkg/controller/common/reconciler/reconciler.go Lines 100 to 107 in 645f750
This results in trying to create a 2024-10-29T23:46:48.632+0100 INFO elasticsearch-controller Deleting resource as it cannot be updated, it will be recreated {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:48.827+0100 INFO elasticsearch-controller Deleted resource successfully {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:48.827+0100 INFO elasticsearch-controller Creating resource {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:49.186+0100 INFO elasticsearch-controller Ending reconciliation run {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "took": "554.2575ms"}
2024-10-29T23:46:49.186+0100 ERROR manager.eck-operator Reconciler error {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "controller": "elasticsearch-controller", "object": {"name":"c0","namespace":"lab"}, "namespace": "lab", "name": "c0", "reconcileID": "f456219b-cdab-4138-945f-00fc1da75b87", "error": "Service \"c0-es-http\" is invalid: spec.externalTrafficPolicy: Invalid value: \"Cluster\": may only be set for externally-accessible services", "errorCauses": [{"error": "Service \"c0-es-http\" is invalid: spec.externalTrafficPolicy: Invalid value: \"Cluster\": may only be set for externally-accessible services"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
/Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
/Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
/Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224
2024-10-29T23:46:49.187+0100 INFO elasticsearch-controller Starting reconciliation run {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0"}
2024-10-29T23:46:49.188+0100 INFO elasticsearch-controller Creating resource {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:49.390+0100 INFO elasticsearch-controller Created resource successfully {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:50.223+0100 INFO elasticsearch-controller Ensuring no voting exclusions are set {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "namespace": "lab", "es_name": "c0"}
2024-10-29T23:46:51.098+0100 INFO elasticsearch-controller Ending reconciliation run {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "took": "1.911611416s"}
2024-10-29T23:46:51.099+0100 INFO elasticsearch-controller Starting reconciliation run {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "48", "namespace": "lab", "es_name": "c0"} We could return after the deletion instead of creating directly but it causes a downtime of the service until the next reconciliation happens to do the creation, not great. I think this gives a good reason to skip applying server side values when the type changes, no? diff --git a/pkg/controller/common/service_control.go b/pkg/controller/common/service_control.go
index 551552b23..12a957888 100644
--- a/pkg/controller/common/service_control.go
+++ b/pkg/controller/common/service_control.go
@@ -88,6 +88,11 @@ func needsUpdate(expected *corev1.Service, reconciled *corev1.Service) bool {
// applyServerSideValues applies any default that may have been set from the reconciled version.
func applyServerSideValues(expected, reconciled *corev1.Service) {
+ // skip if the service type changes from something different to the default ClusterIP value.
+ if reconciled.Spec.Type != corev1.ServiceTypeClusterIP && expected.Spec.Type != reconciled.Spec.Type {
+ return
+ }
+ |
return err | ||
} | ||
if !params.Reconciled.GetDeletionTimestamp().IsZero() { | ||
log.Info("Waiting for resource to be created because the old one is being deleted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Waiting" seems a bit misleading to me. With this change it is up to the caller to detect that the resource has actually not been reconciled, and try again later. By "swallowing"/"hiding" the conflict and returning nil
here we are assuming that the caller is going to do another attempt, but we can't be sure of that? This makes me feel that it should be up to the caller to decide what to do/log in case of a conflict, by using apierrors.IsAlreadyExists()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we can't be 100% sure that the caller will retry. You proposal looks a good idea.
This means we will delete and (try) to create until the deletion is effective.
The only thing that tickles me is that with the current log messages it can be a little confusing.
From LoadBalancer to ClusterIP:
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Deleting resource as it cannot be updated, it will be recreated
Deleted resource successfully
Creating resource
Creating resource
Created resource successfully
Ensuring no voting exclusions are set
Ensuring no voting exclusions are set
Ensuring no voting exclusions are set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that tickles me is that with the current log messages it can be a little confusing.
Is it when we requeue immediately, or when we just ignore the error in the driver? Maybe we could add an additional log in that specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is when we ignore the error and then requeue in the driver (at the caller level), see 0218fd6.
Yes, I think it makes sense. |
…or at the caller level
… different to the default ClusterIP value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
These changes ensure that when deleting the type of an http(s) external service declaration in an Elastic resource definition, the service type is properly reset to the default ClusterIP value. This can be broken down into 3 changes: - When the type of an external service is not set, it defaults to ClusterIP. Thats seems mandatory to me to be able to detect to reset ClusterIP when no service is configured while there is still the version of the service typed LoadBalancer on the server side. It's ok to do this change because the default value of the Type field is this value (v1/types.go). So, this will not change the current type for existing clusters that have not set the type. - When an external service is reconciled, we requeue if we get an alreadyExists error. This is because the deletion of the service can take many seconds (e.g.: gcp lb deletion takes >30s), resulting in the creation while the resource is still being deleted. - When the type of a service changes from something different to the default ClusterIP value, server side changes are not applied. We may lose some case where we could update instead of recreate but it will avoid some cases where certain fields have no sense with the new type.
These changes ensure that when deleting the type of an http(s) external service declaration in an Elastic resource definition, the service type is properly reset to the default ClusterIP value. This can be broken down into 3 changes: - When the type of an external service is not set, it defaults to ClusterIP. Thats seems mandatory to me to be able to detect to reset ClusterIP when no service is configured while there is still the version of the service typed LoadBalancer on the server side. It's ok to do this change because the default value of the Type field is this value (v1/types.go). So, this will not change the current type for existing clusters that have not set the type. - When an external service is reconciled, we requeue if we get an alreadyExists error. This is because the deletion of the service can take many seconds (e.g.: gcp lb deletion takes >30s), resulting in the creation while the resource is still being deleted. - When the type of a service changes from something different to the default ClusterIP value, server side changes are not applied. We may lose some case where we could update instead of recreate but it will avoid some cases where certain fields have no sense with the new type.
These changes ensure that when deleting the type of an http(s) external service declaration in an elastic resource definition, the service type is properly reset to a default value.
To test: after creating a
LoadBalancer
service, then reapplying without the service type, the http service should be reset toClusterIP
.Relates to #8161.
This can be broken down into
32 changes:ClusterIP
. Thats seems mandatory to me to be able to detect to reset ClusterIP when no service is configured while there is still the version of the service typed LoadBalancer on the server side. It's ok to do this change because the default value of theType
field is this value (v1/types.go). So, this will not change the current type for existing clusters that have not set the type.object is being deleted: services "mycluster-es-http" already exists
many times (10 in 30s with the exponetial backoff) until the resource is deleted.Edit 1: I reverted (d38519e):
Not sure that we need it. In fact, a service is already recreated when type changes from LB->CIP and CIP->LB because we don't apply ClusterIP coming from the server side when the type changes.
Edit 2: Still, if you do the switch CIP->LB->CIP enough quickly before getting a ClusterIP (ok, this is stupid, but) and skipping the recreating, you could see some weird transient error like: