Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serverSideDiff error when removing block from spec.containers[] in Helm chart #20792

Closed
thecosmicfrog opened this issue Nov 13, 2024 · 40 comments · Fixed by #20842
Closed

serverSideDiff error when removing block from spec.containers[] in Helm chart #20792

thecosmicfrog opened this issue Nov 13, 2024 · 40 comments · Fixed by #20842
Assignees
Labels
bug Something isn't working component:helm component:server-side-apply version:2.13 Latest confirmed affected version is 2.13
Milestone

Comments

@thecosmicfrog
Copy link

thecosmicfrog commented Nov 13, 2024

Checklist:

  • [y] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [y] I've included steps to reproduce the bug.
  • [y] I've pasted the output of argocd version.

Describe the bug

I am seeing an error in Argo CD when upgrading a Helm chart from one version to another. The only difference between the Helm chart versions is that the new version removes the resources block from spec.template.spec.containers[0] in the Deployment object. I have noticed that removing other blocks (e.g. env) results in the same issue (so it is not just a resources problem).

The specific error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Additional details:

  • The sync options we have enabled are just ServerSideApply=true.
  • For compare options, ServerSideDiff is enforced on the server side by setting controller.diff.server.side: true in the argo-helm chart values.
  • After a certain period of time (indeterminate - a few minutes?), the app status goes to Healthy/Synced/Sync OK, which initially suggests it "fixed itself", however, you then notice that the "Last Sync" version is still the old Helm chart version. Then, clicking Refresh will re-trigger the error to appear. Triggering a manual Sync appears to actually fix the issue.

To Reproduce

I have built a Helm chart to reproduce this, with two versions (0.0.1 and 0.0.2). Updating will not work, as you will see.

  • Prerequisites:

    • Argo CD installed to the kube-system namespace.
    • Server-side diff enabled (controller.diff.server.side: true).
    • Another namespace created called bug (to host the k8s objects).
  • Create a file called application.yaml to represent our Argo CD Application object:

    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: test-app
    spec:
      project: default
      source:
        repoURL: https://thecosmicfrog.github.io/helm-charts/
        path: .
        targetRevision: 0.0.1  # Baseline chart version:
                               # https://github.com/thecosmicfrog/helm-charts/tree/0.0.1/charts/test-app
        helm:
          releaseName: test-app
        chart: test-app
      destination:
        server: https://kubernetes.default.svc
        namespace: bug
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - ServerSideApply=true
      revisionHistoryLimit: 10
  • Apply this to the cluster:
    kubectl apply -n kube-system -f application.yaml

  • Pods should come up without issue and everything should be correctly synced - as expected for a simple Deployment.

  • Update application.yaml to bump the Helm chart to the new version:

    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: test-app
    spec:
      project: default
      source:
        repoURL: https://thecosmicfrog.github.io/helm-charts/
        path: .
    -   targetRevision: 0.0.1  # Baseline chart version:
    -                          # https://github.com/thecosmicfrog/helm-charts/tree/0.0.1/charts/test-app
    +   targetRevision: 0.0.2  # New chart with `spec.template.spec.containers[0].resources` block removed:
    +                          # https://github.com/thecosmicfrog/helm-charts/compare/0.0.1...0.0.2
        helm:
          releaseName: test-app
        chart: test-app
      destination:
        server: https://kubernetes.default.svc
        namespace: bug
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - ServerSideApply=true
      revisionHistoryLimit: 10
  • Apply this to the cluster:
    kubectl apply -n kube-system -f application.yaml


Expected behavior

The new chart version should install without error, as it is such a straightforward change (resources block removed from containers[0] block).


Actual behavior

Sync Status enters an Unknown state with the new chart version, and App Conditions displays 1 Error. That error is:

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

The only way to seemingly complete the update is to manually sync, which works without issue. We have Auto Sync enabled, so I'm not sure why that does not resolve the issue.


Screenshots

  • Argo CD UI after applying chart 0.0.2:

    Screenshot 2024-11-13 at 17 31 24


  • Several (5-ish?) minutes later - with no intervention from me - error "appears" to be resolved... but note the versions are not matching between both sync fields:

    Screenshot 2024-11-13 at 17 38 01


  • Then, clicking Refresh...

    Screenshot 2024-11-13 at 17 40 17


  • ...results in the same 1 Error outcome as before...

    Screenshot 2024-11-13 at 17 41 49


  • ...and the Pods present on the cluster are still from the "old" (0.0.1) chart version:

    Screenshot 2024-11-13 at 17 42 29


  • The only way to fix is to manually Sync:

    Screenshot 2024-11-13 at 17 59 30


  • Which finally brings the app into sync at 0.0.2:

    Screenshot 2024-11-13 at 18 02 42


Version

2024/11/13 14:23:55 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
argocd: v2.13.0+347f221
  BuildDate: 2024-11-04T15:31:13Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.2
  Compiler: gc
  Platform: darwin/arm64
WARN[0001] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web.
argocd-server: v2.13.0+347f221
  BuildDate: 2024-11-04T12:09:06Z
  GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d
  GitTreeState: clean
  GoVersion: go1.23.1
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.3 2024-07-19T16:40:33Z
  Helm Version: v3.15.4+gfa9efb0
  Kubectl Version: v0.31.0
  Jsonnet Version: v0.20.0

Logs

Failed to compare desired state to live state: failed to calculate diff: error calculating server side diff: serverSideDiff error: error removing non config mutations for resource Deployment/test-app: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)

Let me know if the above is enough information to reproduce the issue.

Thanks for your time - Aaron

@thecosmicfrog thecosmicfrog added the bug Something isn't working label Nov 13, 2024
@andrii-korotkov-verkada
Copy link
Contributor

Can you share the helm charts, please? In particular, container spec.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 14, 2024
@thecosmicfrog
Copy link
Author

@andrii-korotkov-verkada Yes, of course. You can find the chart here:
https://github.com/thecosmicfrog/helm-charts/tree/main/charts%2Ftest-app

There are also corresponding Git tags for 0.0.1 and 0.0.2.

@andrii-korotkov-verkada andrii-korotkov-verkada removed the more-information-needed Further information is requested label Nov 14, 2024
@andrii-korotkov-verkada
Copy link
Contributor

When this happens, can you looks at the current vs desired manifest, please?
It's probably a bug in removing webhook fields for comparison (unless you actually have a webhook in the cluster that does it), but worth a shot.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 14, 2024
@thecosmicfrog
Copy link
Author

@andrii-korotkov-verkada The diff seems to be in a bit of a "confused" state from my viewing. See screenshots below.

  • Live Manifest for the Deployment object shows the resources block, as expected.
Screenshot 2024-11-14 at 14 49 02



  • Desired Manifest is missing the resources block (again, as expected).
Screenshot 2024-11-14 at 14 49 32



  • But, the Diff tab is completely blank.
Screenshot 2024-11-14 at 14 46 44



I hope that helps. Let me know what additional information I can provide.

@andrii-korotkov-verkada
Copy link
Contributor

Can you copy-paste whole manifests, please? Sorry for bugging you, it's just I'm looking for things out of line and need full manifests to check that.

@thecosmicfrog
Copy link
Author

@andrii-korotkov-verkada No problem at all! Here's the Deployment manifests.

Live Manifest (managed fields unhidden):

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
  creationTimestamp: '2024-11-13T17:14:51Z'
  generation: 3
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  managedFields:
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            f:app.kubernetes.io/instance: {}
            f:argocd.argoproj.io/instance: {}
        f:spec:
          f:minReadySeconds: {}
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:template:
            f:metadata:
              f:labels:
                f:app.kubernetes.io/instance: {}
            f:spec:
              f:containers:
                k:{"name":"http-echo"}:
                  .: {}
                  f:env:
                    k:{"name":"PORT"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"VERSION"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:livenessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:name: {}
                  f:ports:
                    k:{"containerPort":5678,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:readinessProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
                  f:resources:
                    f:requests:
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    f:allowPrivilegeEscalation: {}
                    f:capabilities:
                      f:drop: {}
                    f:privileged: {}
                  f:startupProbe:
                    f:httpGet:
                      f:path: {}
                      f:port: {}
              f:hostIPC: {}
              f:hostNetwork: {}
              f:hostPID: {}
              f:securityContext:
                f:fsGroup: {}
                f:runAsGroup: {}
                f:runAsNonRoot: {}
                f:runAsUser: {}
                f:seccompProfile:
                  f:type: {}
              f:terminationGracePeriodSeconds: {}
      manager: argocd-controller
      operation: Apply
      time: '2024-11-14T14:43:51Z'
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: kube-controller-manager
      operation: Update
      subresource: status
      time: '2024-11-14T14:44:34Z'
  name: test-app
  namespace: sandbox-aaron
  resourceVersion: '200923886'
  uid: 7b7349f2-9000-4fe3-a443-9eb4e1a1a659
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.1
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              cpu: 100m
              memory: 32Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 5678
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80
status:
  availableReplicas: 2
  conditions:
    - lastTransitionTime: '2024-11-14T06:03:11Z'
      lastUpdateTime: '2024-11-14T06:03:11Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-11-13T17:14:51Z'
      lastUpdateTime: '2024-11-14T14:44:34Z'
      message: ReplicaSet "test-app-74dfc69c76" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2



Desired Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: test-app
    argocd.argoproj.io/instance: test-app
  name: test-app
  namespace: sandbox-aaron
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: test-app
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: test-app
    spec:
      containers:
        - env:
            - name: VERSION
              value: 0.0.2
            - name: PORT
              value: '5678'
          image: hashicorp/http-echo
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /
              port: 5678
          name: http-echo
          ports:
            - containerPort: 5678
              name: http
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /
              port: 5678
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            privileged: false
          startupProbe:
            httpGet:
              path: /
              port: 5678
      hostIPC: false
      hostNetwork: false
      hostPID: false
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 80

Let me know if you'd like me to re-post with the managed fields hidden, or anything else. Thanks!

@andrii-korotkov-verkada andrii-korotkov-verkada removed the more-information-needed Further information is requested label Nov 14, 2024
@andrii-korotkov-verkada
Copy link
Contributor

Hm, I don't see anything obviously wrong at the moment.

@thecosmicfrog
Copy link
Author

Hm, I don't see anything obviously wrong at the moment.

Indeed. Notably, setting IncludeMutationWebhook=true in the SyncOptions appears to resolve the issue, but this doesn't seem like it should be necessary for such a simple change (removal of a block from containers[])? Hence why I'm hesitant to proceed with setting that flag.

Are you able to reproduce on your side? I believe the instructions and charts I provided should be enough to do so, but please advise if I can provide anything else.

@andrii-korotkov-verkada
Copy link
Contributor

One more thing - do you have mutating webhooks setup in the cluster?

@thecosmicfrog
Copy link
Author

We have three MutatingWebhooks:

  • aws-load-balancer-webhook
  • pod-identity-webhook
  • vpc-resource-mutating-webhook

As I understand it, all are a part of Amazon EKS and AWS Load Balancer Controller.

@thecosmicfrog
Copy link
Author

thecosmicfrog commented Nov 15, 2024

Hi @andrii-korotkov-verkada. I have some additional information which should help you in finding the root cause of this.

I use Argo Rollouts for most of my applications (thus I use Rollout objects instead of Deployment). But the original error was triggered for an app using a Deployment and thus is what I used in my reproduction Helm charts.

Out of curiosity, I decided to see if the same error would trigger when using Rollout. I figured it would, since that is mostly a drop-in replacement for Deployment but, to my surprise, it seems to work without issue!

Please see my latest chart versions:

  • 0.3.1: This chart is essentially the same as 0.0.1, but using a Rollout instead of a Deployment.
  • 0.3.2: This is 0.3.1 with the resources block removed from spec.template.spec.containers[0].

See the code for 0.3.1 and 0.3.2 here. I had to add two very basic Service objects as this is required by Argo Rollouts, but you can likely just ignore them.

The chart artifacts are built and uploaded as before, so you can simply update spec.source.targetRevision in the application.yaml file I provided in the original post and kubectl apply.

I hope this helps.

Thanks - Aaron

@andrii-korotkov-verkada
Copy link
Contributor

I can see that various probes are being modified as well, like readiness probe, i.e. live manifest has extra fields comparing to the target one.

Also, terminationMessagePath: /dev/termination-log and terminationMessagePolicy: File are in live state, but not in the target state.

Do you know what adds those?

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 17, 2024
@thecosmicfrog
Copy link
Author

thecosmicfrog commented Nov 17, 2024

@andrii-korotkov-verkada The probe changes are coming from Kubernetes defaults:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

i.e. the default periodSeconds is 10; the default successThreshold is 1, etc.

The same for terminationMessagePath and terminationMessagePolicy - these are Kubernetes defaults, per:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#deployment-v1-apps

@andrii-korotkov-verkada
Copy link
Contributor

I see. At this point, I don't know what can be the issue. I'd suggest trying to get a minimal repro, i.e. remove fields from the spec until it starts working normally.

@andrii-korotkov-verkada andrii-korotkov-verkada removed the more-information-needed Further information is requested label Nov 17, 2024
@thecosmicfrog
Copy link
Author

thecosmicfrog commented Nov 17, 2024

@andrii-korotkov-verkada I'm not sure what else I can remove here - I feel the chart I provided (a single Deployment object) was pretty minimal?

Our real applications are significantly more complex, so I already trimmed down as much as I could, going for this "single Deployment echo-server" example.

One thing that I know fixes the issue is setting IncludeMutationWebhook=true in the Application's compare-options, per:
https://argo-cd.readthedocs.io/en/latest/user-guide/diff-strategies/#mutation-webhooks

But it's unclear to me why this fixes the issue. From a pure and simple perspective, why would removing a field from a Deployment break Argo's sync engine? It seems like a simple diff to resolve?

I note many similarities with this open issue: #17568. The reporter is seeing the same error, word-for-word. The error itself doesn't even seem to make sense (it's complaining that containers[0] doesn't have a name field, but it does).

Thanks for your help!

@andrii-korotkov-verkada
Copy link
Contributor

andrii-korotkov-verkada commented Nov 17, 2024

I see. I'll give a shot to reading the code for removing webhook mutations and try to find bugs there. I'll get back with results of my investigation.

@thecosmicfrog
Copy link
Author

Thanks @andrii-korotkov-verkada - hugely appreciated. As always, let me know if there's any additional information I can provide.

@andrii-korotkov-verkada
Copy link
Contributor

I'll try to upgrade the dependency lib to see if that helps argoproj/gitops-engine#637.

andrii-korotkov-verkada added a commit to andrii-korotkov-verkada/argo-cd that referenced this issue Nov 19, 2024
Fixes argoproj#20792

Use a new version of gitops engine with a fix.

This needs to be cherry-picked in v2.11-2.13.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
andrii-korotkov-verkada added a commit to andrii-korotkov-verkada/gitops-engine that referenced this issue Nov 19, 2024
Helps with argoproj/argo-cd#20792

Removed and modified sets may only contain the fields that changed, not including key fields like "name". This can cause merge to fail, since it expects those fields to be present if they are present in the predicted live.
Fortunately, we can inspect the set and derive the key fields necessary. Then they can be added to the set and used during a merge.
Also, have a new test which fails before the fix, but passes now.

Failure of the new test before the fix
```
            	Error:      	Received unexpected error:
            	            	error removing non config mutations for resource Deployment/nginx-deployment: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)
            	Test:       	TestServerSideDiff/will_test_removing_some_field_with_undoing_changes_done_by_webhook
```

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@andrii-korotkov-verkada
Copy link
Contributor

In theory, structured merge diff library should be able to merge using key values from paths. I've opened kubernetes-sigs/structured-merge-diff#273 to follow-up with them.

@andrii-korotkov-verkada
Copy link
Contributor

I've also posted on Kubernetes Slack in the respective channel and tagged whom seems to be the maintainer based on commits.

andrii-korotkov-verkada added a commit to andrii-korotkov-verkada/gitops-engine that referenced this issue Nov 22, 2024
Helps with argoproj/argo-cd#20792

Removed and modified sets may only contain the fields that changed, not including key fields like "name". This can cause merge to fail, since it expects those fields to be present if they are present in the predicted live.
Fortunately, we can inspect the set and derive the key fields necessary. Then they can be added to the set and used during a merge.
Also, have a new test which fails before the fix, but passes now.

Failure of the new test before the fix
```
            	Error:      	Received unexpected error:
            	            	error removing non config mutations for resource Deployment/nginx-deployment: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)
            	Test:       	TestServerSideDiff/will_test_removing_some_field_with_undoing_changes_done_by_webhook
```

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@thecosmicfrog
Copy link
Author

Hi @andrii-korotkov-verkada. Thanks for the discussion on today's contributor call.

Am I correct in my understanding that you are waiting on further discussion with the structured-merge-diff maintainers? And, failing any progress on that side, you will aim to merge your fix on the Argo CD side as a workaround? Just so I'm clear.

Thanks again.

@andrii-korotkov-verkada
Copy link
Contributor

We are waiting for them, yes. There's a volunteer to look into this at the moment. I'm not sure a workaround would get merged unless they really push back hard.

@leoluz
Copy link
Collaborator

leoluz commented Dec 11, 2024

@thecosmicfrog @andrii-korotkov-verkada Structured-merge-diff PR with the fix is now merged:
kubernetes-sigs/structured-merge-diff#275

andrii-korotkov-verkada added a commit to andrii-korotkov-verkada/gitops-engine that referenced this issue Dec 11, 2024
Helps with argoproj/argo-cd#20792

Removed and modified sets may only contain the fields that changed, not including key fields like "name". This can cause merge to fail, since it expects those fields to be present if they are present in the predicted live.
Fortunately, we can inspect the set and derive the key fields necessary. Then they can be added to the set and used during a merge.
Also, have a new test which fails before the fix, but passes now.

Failure of the new test before the fix
```
            	Error:      	Received unexpected error:
            	            	error removing non config mutations for resource Deployment/nginx-deployment: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)
            	Test:       	TestServerSideDiff/will_test_removing_some_field_with_undoing_changes_done_by_webhook
```

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
leoluz pushed a commit to argoproj/gitops-engine that referenced this issue Dec 11, 2024
* fix: Server side diff now works correctly with some fields removal

Helps with argoproj/argo-cd#20792

Removed and modified sets may only contain the fields that changed, not including key fields like "name". This can cause merge to fail, since it expects those fields to be present if they are present in the predicted live.
Fortunately, we can inspect the set and derive the key fields necessary. Then they can be added to the set and used during a merge.
Also, have a new test which fails before the fix, but passes now.

Failure of the new test before the fix
```
            	Error:      	Received unexpected error:
            	            	error removing non config mutations for resource Deployment/nginx-deployment: error reverting webhook removed fields in predicted live resource: .spec.template.spec.containers: element 0: associative list with keys has an element that omits key field "name" (and doesn't have default value)
            	Test:       	TestServerSideDiff/will_test_removing_some_field_with_undoing_changes_done_by_webhook
```

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* Use new version of structured merge diff with a new option

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* Add DCO

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* Try to fix sonar exclusions config

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

---------

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@leoluz
Copy link
Collaborator

leoluz commented Dec 11, 2024

@thecosmicfrog The fix is in this PR: #20842 and auto-merge is enabled.

Will you be able to validate this build in your environment once merged?

leoluz pushed a commit that referenced this issue Dec 11, 2024
* fix: Fix server side diff with fields removal (#20792)

Fixes #20792

Use a new version of gitops engine with a fix.

This needs to be cherry-picked in v2.11-2.13.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* Use newer versions of gitops-engine and structured-merge-diff

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

---------

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@thecosmicfrog
Copy link
Author

thecosmicfrog commented Dec 11, 2024

@leoluz Great news, thanks! I'm using argo-helm to install and configure Argo CD. Are there any releases of that chart I can use, or instructions on how to build my own? Apologies - am away from my dev machine at the moment.

@andrii-korotkov-verkada
Copy link
Contributor

@thecosmicfrog, I usually use make image and create docker manifests as well, and then upload it in whatever format the company uses (AWS ECR in our case).

@leoluz leoluz reopened this Dec 11, 2024
@leoluz
Copy link
Collaborator

leoluz commented Dec 11, 2024

Reopening and waiting for @thecosmicfrog to confirm if the issue is addressed in current master

@thecosmicfrog
Copy link
Author

Thanks @leoluz @andrii-korotkov-verkada - will probably be tomorrow before I can check. Will report back ASAP.

@thecosmicfrog
Copy link
Author

Hi @andrii-korotkov-verkada @leoluz. Great news. I have tested three previously-failing applications and they are now working as expected! No more errors related to server-side diff and the applications sync up correctly.

I just wanted to express my thanks to you for working on this. It's been a great experience and I hugely appreciate it.

Side question - do you know when this new image will be released in argo-helm?

I'll go ahead and close the issue now. All the best! 🙂

@andrii-korotkov-verkada
Copy link
Contributor

You are welcome!

GuySaar8 pushed a commit to GuySaar8/argo-cd that referenced this issue Dec 12, 2024
…roj#20842)

* fix: Fix server side diff with fields removal (argoproj#20792)

Fixes argoproj#20792

Use a new version of gitops engine with a fix.

This needs to be cherry-picked in v2.11-2.13.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* Use newer versions of gitops-engine and structured-merge-diff

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

---------

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@leoluz leoluz added this to the v2.14 milestone Dec 12, 2024
@leoluz
Copy link
Collaborator

leoluz commented Dec 12, 2024

Side question - do you know when this new image will be released in argo-helm?

@thecosmicfrog This fix is going to be part for Argo CD 2.14.

@thecosmicfrog
Copy link
Author

@leoluz Fantastic, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:helm component:server-side-apply version:2.13 Latest confirmed affected version is 2.13
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants