Fix resetting service type to default when not specified #8165

thbkrkr · 2024-10-29T19:15:48Z

These changes ensure that when deleting the type of an http(s) external service declaration in an elastic resource definition, the service type is properly reset to a default value.

To test: after creating a LoadBalancer service, then reapplying without the service type, the http service should be reset to ClusterIP.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: c0
spec:
  version: 8.15.3
  http:
    service:
      metadata:
        annotations:
          answer: "42"
      spec:
         #type: LoadBalancer
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false

Relates to #8161.

This can be broken down into 3 2 changes:

When the type of an external service is not set, it defaults to ClusterIP. Thats seems mandatory to me to be able to detect to reset ClusterIP when no service is configured while there is still the version of the service typed LoadBalancer on the server side. It's ok to do this change because the default value of the Type field is this value (v1/types.go). So, this will not change the current type for existing clusters that have not set the type.
When a resource is reconciled and needs to be recreated, reconciliation is skipped if the resource is being deleted. This is because the deletion of the service can take many seconds (e.g.: gcp lb deletion takes >30s). Without this, the reconciliation fails with error object is being deleted: services "mycluster-es-http" already exists many times (10 in 30s with the exponetial backoff) until the resource is deleted.

2024-10-29T20:25:33.611+0100    ERROR   manager.eck-operator    Reconciler error       {
    "service.version": "2.16.0-SNAPSHOT+1dbb7572",  "controller": "elasticsearch-controller",
    "object": {"name": "c0", "namespace": "lab"},
    "namespace": "lab", "name": "c0", "reconcileID": "bb3f9bf3-0ef4-45f6-be9d-a73c94f01d93",
    "error": "object is being deleted: services \"c0-es-http\" already exists",
    "errorCauses": [{"error": "object is being deleted: services \"c0-es-http\" already exists"}]
}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224

Edit 1: I reverted (d38519e):

When the type of a service changes, server side changes are not applied. We may lose some case where we could update instead of recreate but I think it will avoid some cases where certain fields have no sense with the new type.

Not sure that we need it. In fact, a service is already recreated when type changes from LB->CIP and CIP->LB because we don't apply ClusterIP coming from the server side when the type changes.

Edit 2: Still, if you do the switch CIP->LB->CIP enough quickly before getting a ClusterIP (ok, this is stupid, but) and skipping the recreating, you could see some weird transient error like:

    "error": "Service \"c0-es-http\" is invalid: [
        spec.allocateLoadBalancerNodePorts: Forbidden: may only be used when `type` is 'LoadBalancer',
        spec.externalTrafficPolicy: Invalid value: \"Cluster\": may only be set for externally-accessible services]"

…IP type

…LoadBalancer type to ClusterIP

thbkrkr · 2024-10-29T23:23:29Z

I see another bug. NodePort -> ClusterIP. We detect that the service needs to be recreated but we apply the server side values to the expected service, that is then copied into the object service to be created:

cloud-on-k8s/pkg/controller/common/reconciler/reconciler.go

Lines 100 to 107 in 645f750

    
           // Copy the content of params.Expected into params.Reconciled. 
        
           // Unfortunately it's not straightforward to change the value of an interface underlying pointer, 
        
           // so we need a small bit of reflection here. 
        
           // This will panic if params.Expected and params.Reconciled don't have the same underlying type. 
        
           expectedCopyValue := reflect.ValueOf(params.Expected.DeepCopyObject()).Elem() 
        
           reflect.ValueOf(params.Reconciled).Elem().Set(expectedCopyValue) 
        
           // Create the object, which modifies params.Reconciled in-place 
        
           err = params.Client.Create(params.Context, params.Reconciled)

This results in trying to create a ClusterIP service with externalTrafficPolicy and boom:

2024-10-29T23:46:48.632+0100    INFO    elasticsearch-controller        Deleting resource as it cannot be updated, it will be recreated {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:48.827+0100    INFO    elasticsearch-controller        Deleted resource successfully   {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:48.827+0100    INFO    elasticsearch-controller        Creating resource       {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:49.186+0100    INFO    elasticsearch-controller        Ending reconciliation run       {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "46", "namespace": "lab", "es_name": "c0", "took": "554.2575ms"}
2024-10-29T23:46:49.186+0100    ERROR   manager.eck-operator    Reconciler error        {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "controller": "elasticsearch-controller", "object": {"name":"c0","namespace":"lab"}, "namespace": "lab", "name": "c0", "reconcileID": "f456219b-cdab-4138-945f-00fc1da75b87", "error": "Service \"c0-es-http\" is invalid: spec.externalTrafficPolicy: Invalid value: \"Cluster\": may only be set for externally-accessible services", "errorCauses": [{"error": "Service \"c0-es-http\" is invalid: spec.externalTrafficPolicy: Invalid value: \"Cluster\": may only be set for externally-accessible services"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /Users/krkr/dev/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224
2024-10-29T23:46:49.187+0100    INFO    elasticsearch-controller        Starting reconciliation run     {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0"}
2024-10-29T23:46:49.188+0100    INFO    elasticsearch-controller        Creating resource       {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:49.390+0100    INFO    elasticsearch-controller        Created resource successfully   {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "kind": "Service", "namespace": "lab", "name": "c0-es-http"}
2024-10-29T23:46:50.223+0100    INFO    elasticsearch-controller        Ensuring no voting exclusions are set   {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "namespace": "lab", "es_name": "c0"}
2024-10-29T23:46:51.098+0100    INFO    elasticsearch-controller        Ending reconciliation run       {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "47", "namespace": "lab", "es_name": "c0", "took": "1.911611416s"}
2024-10-29T23:46:51.099+0100    INFO    elasticsearch-controller        Starting reconciliation run     {"service.version": "2.16.0-SNAPSHOT+ecaf23ef", "iteration": "48", "namespace": "lab", "es_name": "c0"}

We could return after the deletion instead of creating directly but it causes a downtime of the service until the next reconciliation happens to do the creation, not great.

I think this gives a good reason to skip applying server side values when the type changes, no?

diff --git a/pkg/controller/common/service_control.go b/pkg/controller/common/service_control.go
index 551552b23..12a957888 100644
--- a/pkg/controller/common/service_control.go
+++ b/pkg/controller/common/service_control.go
@@ -88,6 +88,11 @@ func needsUpdate(expected *corev1.Service, reconciled *corev1.Service) bool {

 // applyServerSideValues applies any default that may have been set from the reconciled version.
 func applyServerSideValues(expected, reconciled *corev1.Service) {
+       // skip if the service type changes from something different to the default ClusterIP value.
+       if reconciled.Spec.Type != corev1.ServiceTypeClusterIP && expected.Spec.Type != reconciled.Spec.Type {
+               return
+       }
+

barkbay · 2024-10-30T07:43:40Z

pkg/controller/common/reconciler/reconciler.go

+				return err
+			}
+			if !params.Reconciled.GetDeletionTimestamp().IsZero() {
+				log.Info("Waiting for resource to be created because the old one is being deleted")


"Waiting" seems a bit misleading to me. With this change it is up to the caller to detect that the resource has actually not been reconciled, and try again later. By "swallowing"/"hiding" the conflict and returning nil here we are assuming that the caller is going to do another attempt, but we can't be sure of that? This makes me feel that it should be up to the caller to decide what to do/log in case of a conflict, by using apierrors.IsAlreadyExists()

You are right, we can't be 100% sure that the caller will retry. You proposal looks a good idea.

This means we will delete and (try) to create until the deletion is effective.

The only thing that tickles me is that with the current log messages it can be a little confusing.

From LoadBalancer to ClusterIP:

Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Deleting resource as it cannot be updated, it will be recreated Deleted resource successfully Creating resource Creating resource Created resource successfully Ensuring no voting exclusions are set Ensuring no voting exclusions are set Ensuring no voting exclusions are set

The only thing that tickles me is that with the current log messages it can be a little confusing.

Is it when we requeue immediately, or when we just ignore the error in the driver? Maybe we could add an additional log in that specific case.

It is when we ignore the error and then requeue in the driver (at the caller level), see 0218fd6.

barkbay · 2024-10-30T07:57:40Z

I think this gives a good reason to skip applying server side values when the type changes, no?

Yes, I think it makes sense.

…or at the caller level

… different to the default ClusterIP value

pkg/controller/elasticsearch/driver/driver.go

barkbay

LGTM

These changes ensure that when deleting the type of an http(s) external service declaration in an Elastic resource definition, the service type is properly reset to the default ClusterIP value. This can be broken down into 3 changes: - When the type of an external service is not set, it defaults to ClusterIP. Thats seems mandatory to me to be able to detect to reset ClusterIP when no service is configured while there is still the version of the service typed LoadBalancer on the server side. It's ok to do this change because the default value of the Type field is this value (v1/types.go). So, this will not change the current type for existing clusters that have not set the type. - When an external service is reconciled, we requeue if we get an alreadyExists error. This is because the deletion of the service can take many seconds (e.g.: gcp lb deletion takes >30s), resulting in the creation while the resource is still being deleted. - When the type of a service changes from something different to the default ClusterIP value, server side changes are not applied. We may lose some case where we could update instead of recreate but it will avoid some cases where certain fields have no sense with the new type.

thbkrkr added 5 commits October 29, 2024 14:34

Unit test verifying that the external service has the default cluster…

32fe7f9

…IP type

Update NewExternalService with default ClusterIP if type not set

5b63694

Unit test verifying that a services needs to be recreated to go from …

d692c82

…LoadBalancer type to ClusterIP

Shortcut applyServerSideValues when the service type has changed

d95f488

Skip reconciliation if resource is being deleted

1dbb757

thbkrkr added the >bug Something isn't working label Oct 29, 2024

thbkrkr added 4 commits October 29, 2024 22:00

Check being deleted only after deleting

0d3a897

Revert explicit check on type to recreate when type changes

d38519e

Update unit tests with default ClusterIP value

4a9aac0

nit: reuse svc and move

ecaf23e

thbkrkr marked this pull request as ready for review October 29, 2024 22:30

thbkrkr requested review from pebrc and barkbay October 29, 2024 22:34

barkbay reviewed Oct 30, 2024

View reviewed changes

thbkrkr added 2 commits October 30, 2024 12:21

Revert being deleted detection in favor of handling AlreadyExists err…

0218fd6

…or at the caller level

skip applyServerSideValues if the service type changes from something…

79a182d

… different to the default ClusterIP value

thbkrkr commented Oct 30, 2024

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

barkbay reviewed Oct 30, 2024

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

Better message after requeuing for external service reconciliation

f320dcc

pebrc approved these changes Oct 30, 2024

View reviewed changes

barkbay approved these changes Oct 30, 2024

View reviewed changes

pebrc added v2.16.0 v2.15.0 and removed v2.15.0 v2.16.0 labels Oct 30, 2024

Merge branch 'main' into fix-reset-service-type

c2acb4b

thbkrkr merged commit e011888 into main Oct 31, 2024
5 checks passed

thbkrkr deleted the fix-reset-service-type branch October 31, 2024 11:13

thbkrkr changed the title ~~Fix reset service type~~ Fix resetting service type to default when not specified Oct 31, 2024

thbkrkr mentioned this pull request Oct 31, 2024

Fix resetting service type to default when not specified (#8165) #8169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resetting service type to default when not specified #8165

Fix resetting service type to default when not specified #8165

thbkrkr commented Oct 29, 2024 •

edited

Loading

thbkrkr commented Oct 29, 2024 •

edited

Loading

barkbay Oct 30, 2024

thbkrkr Oct 30, 2024

barkbay Oct 30, 2024

thbkrkr Oct 30, 2024

barkbay commented Oct 30, 2024

barkbay left a comment

Fix resetting service type to default when not specified #8165

Fix resetting service type to default when not specified #8165

Conversation

thbkrkr commented Oct 29, 2024 • edited Loading

thbkrkr commented Oct 29, 2024 • edited Loading

barkbay Oct 30, 2024

Choose a reason for hiding this comment

thbkrkr Oct 30, 2024

Choose a reason for hiding this comment

barkbay Oct 30, 2024

Choose a reason for hiding this comment

thbkrkr Oct 30, 2024

Choose a reason for hiding this comment

barkbay commented Oct 30, 2024

barkbay left a comment

Choose a reason for hiding this comment

thbkrkr commented Oct 29, 2024 •

edited

Loading

thbkrkr commented Oct 29, 2024 •

edited

Loading