Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from v0.17.2 to v0.18.0 causes changes on each reconcile cycle #450

Closed
davidkarlsen opened this issue Oct 10, 2021 · 19 comments · Fixed by #451
Closed

Upgrade from v0.17.2 to v0.18.0 causes changes on each reconcile cycle #450

davidkarlsen opened this issue Oct 10, 2021 · 19 comments · Fixed by #451

Comments

@davidkarlsen
Copy link
Contributor

davidkarlsen commented Oct 10, 2021

Before the upgrade we only got events on slack when there was in fact changes checked into git, after the upgrade we get events on every 10mins:

CustomResourceDefinition/volumesnapshots.snapshot.storage.k8s.io configured
Secret/flux-system/slack-url configured

The slack-url is a sops encrypted secret

@kingdonb
Copy link
Member

kingdonb commented Oct 10, 2021

If you have creationTimestamp: null in your secret's metadata, try removing it.

(If you have the sops.yaml config set up as in the tutorial and sops in PGP mode, with read-write access to the secrets in your keychain, you can re-encrypt the secret with sops <encfile.yaml> – it may not be so easy with KMS or other methods.

I thought we resolved all of these issues before 0.18.0 was released, but apparently/unfortunately there are still a few things like this.

I had this issue in pre-release testing and it went away for me when I removed the unnecessary creationTimestamp setting from my secrets. Please note that if your secrets are encrypted, the data is the encrypted part, but I'm pretty sure the whole secret is hashed and signed. So you may get errors from SOPS if you try to edit the secret by hand, rather than re-encrypting it with the SOPS cli.

I do not think Flux will currently balk at secrets with incorrect hashes or signatures though, I believe it only cares about the encrypted parts (and whether or not the expected keys are available to decrypt them.)

@davidkarlsen
Copy link
Contributor Author

davidkarlsen commented Oct 10, 2021

there's no creationTimestamp in it.
BTW: I use stringData in the secret - that might cause it to be detected as a change?

@davidkarlsen
Copy link
Contributor Author

this resource seems to cause "loops" also:

---
# Possible Template Parameters:
#
# kube-system
# altinity/clickhouse-operator:0.15.0
# etc-clickhouse-operator-confd-files
#
apiVersion: v1
kind: ConfigMap
metadata:
  name: etc-clickhouse-operator-confd-files
  namespace: clickhouse-operator
  labels:
    app: clickhouse-operator
data:

(yes it's empty)

@stefanprodan
Copy link
Member

stefanprodan commented Oct 11, 2021

@davidkarlsen I can't reproduce the empty ConfigMap issue I tried it on Kubernetes 1.20.2 and 1.21.2. Can you please post here flux check. Is there some other controller that updates that ConfigMap in cluster?

@stefanprodan
Copy link
Member

If you have creationTimestamp: null in your secret's metadata, try removing it.

@kingdonb kustomize-controller v0.15 does not take this field into consideration when detecting drift, are you sure you're using 0.15?

@bergemalm
Copy link

bergemalm commented Oct 11, 2021

I can confirm that secrets with creationTimestamp: null causes issue. For each cycle there is a configured event and a notification. Removing creationTimestamp solves that.

Events each cycle before removal:

0s          Normal   info     kustomization/flux-system   Secret/service/my-tls configured
0s          Normal   info     kustomization/flux-system   Reconciliation finished in 1.796202531s, next run in 10m0s```

Events each cycle after removal:

0s          Normal   info     kustomization/flux-system   Reconciliation finished in 1.500361335s, next run in 10m0s

This example's interval is .spec.interval from a HelmRelease.
kustomize-controller v0.15.2.

Edit:
Diff for kustomization between cycles before removal was updated Resource Version and LastTransitionTime.

@stefanprodan
Copy link
Member

stefanprodan commented Oct 11, 2021

@bergemalm thanks for the report! Can you please say which Kubernetes versions are you using? Is that secret sops encrypted ? Where was the creationTimestamp set?

@bergemalm
Copy link

bergemalm commented Oct 11, 2021

@stefanprodan kubernetes GKE version v1.20.9-gke.1001. This secret is a certificate and not encrypted.

apiVersion: v1
kind: Secret
data:
  my.ca: <base64 encoded data>
metadata:
  creationTimestamp: null <---- line removed
  name: my-tls
  namespace: service-namespace

@davidkarlsen
Copy link
Contributor Author

davidkarlsen commented Oct 12, 2021

@stefanprodan I'm still seeing this on my sops secrets:

kustomize-controllerAPP  18:38
kustomization/flux-system.flux-system
Secret/flux-system/slack-url configured
revision
os-sandbox/b36c3c1c481cce60d83f4192cab60b84dfc328ad
18:43
kustomization/flux-system.flux-system
Secret/flux-system/slack-url configured
revision
os-sandbox/b36c3c1c481cce60d83f4192cab60b84dfc328ad

kustomize-controllerAPP  18:48
kustomization/flux-system.flux-system
Secret/flux-system/slack-url configured
revision
os-sandbox/b36c3c1c481cce60d83f4192cab60b84dfc328ad

kustomize-controllerAPP  18:53
kustomization/flux-system.flux-system
Secret/flux-system/slack-url configured
revision
os-sandbox/b36c3c1c481cce60d83f4192cab60b84dfc328ad

with image: ghcr.io/fluxcd/kustomize-controller:v0.15.4

apiVersion: v1
stringData:
    address: ENC[AES256_GCM,data:redacted==,type:str]
kind: Secret
metadata:
    name: slack-url
    namespace: flux-system
sops:
    kms: []
    gcp_kms: []
    azure_kv: []
    hc_vault: []
    age: []
    lastmodified: "2021-10-11T12:36:17Z"
    mac: ENC[AES256_GCM,data:redacted==,type:str]
    pgp:
        - created_at: "2021-02-21T18:51:29Z"
          enc: |-
            redacted
          fp: redacted
    encrypted_regex: ^(data|stringData)$
    version: 3.6.1

cleartext structure:

apiVersion: v1
stringData:
    address: redacted
kind: Secret
metadata:
    name: slack-url
    namespace: flux-system

@seh
Copy link
Contributor

seh commented Oct 12, 2021

I assume there's a special case for Secret, but like fluxcd/flux2#1934, there's no top-level "spec" field here.

@stefanprodan
Copy link
Member

This is really strange as I can't reproduce the SOPS spam on my cluster nor in CI, I've added tests to make sure this doesn't happen, the tests are run on Kubernetes 1.20 and 1.21.

@davidkarlsen
Copy link
Contributor Author

@stefanprodan - sorry - false alarm - the kustomize hadn't rolled out yet - that one is gone now.

But I still struggle with this one:

# Possible Template Parameters:
#
# kube-system
# altinity/clickhouse-operator:0.15.0
# etc-clickhouse-operator-confd-files
#
apiVersion: v1
kind: ConfigMap
metadata:
  name: etc-clickhouse-operator-confd-files
  namespace: clickhouse-operator
  labels:
    app: clickhouse-operator
data:

causing endless reconcile.

@davidkarlsen
Copy link
Contributor Author

davidkarlsen commented Oct 12, 2021

This one too (checked into git vs get from cluster separated by ---:

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: vault-restricted
  annotations:
    kubernetes.io/description: This is the least privileged SCC and it is used by vault users.
allowHostIPC: true
allowHostDirVolumePlugin: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
defaultAddCapabilities: null
allowedCapabilities:
- IPC_LOCK
- SETFCAP
allowedUnsafeSysctls: null
fsGroup:
  type: RunAsAny
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

---

allowHostDirVolumePlugin: false
allowHostIPC: true
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities:
- IPC_LOCK
- SETFCAP
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups: []
kind: SecurityContextConstraints
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"allowHostDirVolumePlugin":false,"allowHostIPC":true,"allowHostNetwork":false,"allowHostPID":false,"allowHostPorts":false,"allowPrivilegeEscalation":true,"allowPrivilegedContainer":true,"allowedCapabilities":["IPC_LOCK","SETFCAP"],"allowedUnsafeSysctls":null,"apiVersion":"security.openshift.io/v1","defaultAddCapabilities":null,"fsGroup":{"type":"RunAsAny"},"kind":"SecurityContextConstraints","metadata":{"annotations":{"kubernetes.io/description":"This is the least privileged SCC and it is used by vault users."},"name":"vault-restricted"},"readOnlyRootFilesystem":false,"requiredDropCapabilities":["KILL","MKNOD"],"runAsUser":{"type":"RunAsAny"},"seLinuxContext":{"type":"MustRunAs"},"supplementalGroups":{"type":"RunAsAny"},"volumes":["configMap","downwardAPI","emptyDir","persistentVolumeClaim","projected","secret"]}
    kubernetes.io/description: This is the least privileged SCC and it is used by
      vault users.
  creationTimestamp: "2021-02-13T11:50:56Z"
  generation: 7
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: vault-restricted
  resourceVersion: "1417667107"
  uid: 916b0536-ee09-4488-b714-0ce1991db0dd
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

I'll try to drop the XXX: null ones

@stefanprodan
Copy link
Member

@davidkarlsen thanks for posting both manifests, going to work on the fix tomorrow that should cover all CRs that don't have spec. My assumption was that any custom resource has the user-facing fields in .spec but seems that many controllers don't follow this convention.

PS. Please post here you Kubernetes version, I have added tests for empty ConfigMap and I have no way to replicate it.

@davidkarlsen
Copy link
Contributor Author

davidkarlsen commented Oct 12, 2021

Removing the XXX: nulls from the SCC fixed that one. Configmap (#450 (comment)) still being applied all the time.
Version:

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1+9807387", GitCommit:"98073871f173baaa04dc2bafab50effd62c308a6", GitTreeState:"clean", BuildDate:"2021-08-06T14:06:43Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

which is openshift 4.8.11, but it should really be vanilla k8s for this resource (configmap)

@stefanprodan
Copy link
Member

@davidkarlsen can you please do a test for me, can you set data: {} and see if changes anything?

@davidkarlsen
Copy link
Contributor Author

@davidkarlsen can you please do a test for me, can you set data: {} and see if changes anything?

Yes, that seems to work.

@davidkarlsen
Copy link
Contributor Author

Naaah, to soon to conclude, same:

kustomize-controllerAPP  20:59
kustomization/flux-system.flux-system
ConfigMap/clickhouse-operator/etc-clickhouse-operator-confd-files configured
revision
os-global/d122b15af00016a93ca1a2164cd05e4f0eddae1b

kustomize-controllerAPP  21:12
kustomization/flux-system.flux-system
ConfigMap/clickhouse-operator/etc-clickhouse-operator-confd-files configured
revision
os-global/d122b15af00016a93ca1a2164cd05e4f0eddae1b

@stefanprodan
Copy link
Member

Ok please try to remove data: like you did with the SCC fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants