Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't set an ownerRef on secrets users are susceptible to copy around #3992

Merged
merged 14 commits into from
Dec 4, 2020

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Nov 30, 2020

A k8s bug (kubernetes/kubernetes#65200) may cause the
k8s garbage collection to delete undesired resources in case users manually
copy an operator-managed secret to another namespace.

To avoid that situation, this commit ensures no ownerRef is set on a subset of
managed secrets users are susceptible to copy around:

  • the elastic user password secret
  • elasticsearch public transport certs
  • elasticsearch, kibana, enterprise search, apm server public http certs

Existing ownerReferences set with earlier ECK versions will be removed when
reconciled.

Since they do not have an ownerRef anymore, those secrets are not automatically
deleted when the Elasticsearch resource is deleted. To work around that situation,
the secret reconciliation logic adds an additional set of labels to the reconciled
secrets that don't have an ownerRef specified. These labels reference the "soft"
owner ("soft" as in handled through some custom code and not through k8s builtin
garbage collection logic).
Once a controller receives a deletion event for the resource it manages, it will
automatically remove the soft-owned secrets.

Additionally, garbage collection runs at operator startup to cover the case where
owners were deleted while the operator was down.

TODO:

  • Unit tests
  • E2E test
  • GC secrets on operator startup to cover the case where the operator is down during the owner deletion
  • Major bug to fix: the soft owner reference needs to include the object Kind, otherwise we may end up mismatching e.g. Elasticsearch stuff with Kibana stuff if they have the same name.

Fixes #3986.

A k8s bug (kubernetes/kubernetes#65200) may cause the
k8s garbage collection to delete undesired resources in case users manually
copy an operator-managed secret to another namespace.

To avoid that situation, this commit ensures no ownerRef is set on a subset of
managed secrets users are susceptible to copy around:
- the elastic user password secret
- elasticsearch public transport certs
- elasticsearch, kibana, enterprise search, apm server public http certs

Existing ownerReferences set with earlier ECK versions will be removed when
reconciled.

Since they do not have an ownerRef anymore, those secrets are not automatically
deleted when the Elasticsearch resource is deleted. To work around that situation,
the secret reconciliation logic adds an additional set of labels to the reconciled
secrets that don't have an ownerRef specified. These labels reference the "soft"
owner ("soft" as in handled through some custom code and not through k8s builtin
garbage collection logic).
Once a controller receives a deletion event for the resource it manages, it will
automatically remove the soft-owned secrets.
This is done as best-effort. Secrets will remain orphan if:
- the operator is not running when the owner is deleted
- an error happens while deleting the soft-owned secrets
@sebgl sebgl changed the title Don't set an ownerRef on secrets users are susceptible to copy around (wip) Don't set an ownerRef on secrets users are susceptible to copy around Nov 30, 2020
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave it a try, lgtm so far. I left a few nits and one question about the use of the UID instead of the object name.

pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
// - we remove the existing owner
// - we set additional labels to perform garbage collection on owner deletion (best-effort)
expected.Labels[SoftOwnerNamespaceLabel] = theoreticalOwner.GetNamespace()
expected.Labels[SoftOwnerNameLabel] = theoreticalOwner.GetName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense to rely on the owner's UID, and not on the object name, to be consistent with what is done by the K8S GC ? It would require for the object to have been stored first but I think it is a fair assumption ?

Suggested change
expected.Labels[SoftOwnerNameLabel] = theoreticalOwner.GetName()
expected.Labels[SoftOwnerUID] = string(theoreticalOwner.GetUID())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think the problem is that the garbage collection part of the code only knows about the deleted resource name + namespace (as part of the reconcile.Request). We don't know anything about the parent ID since it doesn't exist anymore? (Maybe I'm misunderstanding what you mean).

pkg/controller/elasticsearch/elasticsearch_controller.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/elasticsearch_controller.go Outdated Show resolved Hide resolved
@sebgl
Copy link
Contributor Author

sebgl commented Dec 1, 2020

  • Major bug to fix: the soft owner reference needs to include the object Kind, otherwise we may end up mismatching e.g. Elasticsearch stuff with Kibana stuff if they have the same name.

cmd/manager/main.go Outdated Show resolved Hide resolved
pkg/apis/apm/v1/apmserver_types.go Show resolved Hide resolved
@sebgl
Copy link
Contributor Author

sebgl commented Dec 1, 2020

Current status: this PR is missing unit tests, but otherwise I don't expect any code change.

@sebgl sebgl changed the title (wip) Don't set an ownerRef on secrets users are susceptible to copy around Don't set an ownerRef on secrets users are susceptible to copy around Dec 2, 2020
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did a few simple tests that all seemed to work. Quirk: If you delete your es cluster while the operator is not running and recreate another cluster with the same name it adopts the left over resources from the previous incarnation which means password stays the same etc.

cmd/manager/main.go Outdated Show resolved Hide resolved
cmd/manager/main_test.go Show resolved Hide resolved
pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
pkg/controller/common/reconciler/secret.go Outdated Show resolved Hide resolved
@sebgl
Copy link
Contributor Author

sebgl commented Dec 3, 2020

run full pr build

@sebgl
Copy link
Contributor Author

sebgl commented Dec 3, 2020

We're missing a watch on soft-owned secrets to automatically trigger reconciliations on any change (e.g. deleting the elastic user secret).
I'll add another commit.

@pebrc
Copy link
Collaborator

pebrc commented Dec 3, 2020

We're missing a watch on soft-owned secrets to automatically trigger reconciliations on any change (e.g. deleting the elastic user secret).
I'll add another commit.

Good catch! I totally forgot that we are basing our watches on the owner reference :-(

@sebgl sebgl requested a review from pebrc December 3, 2020 12:01
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, so happy we have these e2e tests...

pkg/controller/common/watches/secrets_test.go Show resolved Hide resolved
pkg/controller/common/watches/secrets_test.go Outdated Show resolved Hide resolved
@pebrc
Copy link
Collaborator

pebrc commented Dec 3, 2020

Jenkins test this please

My bad still not fixed #3617

@sebgl
Copy link
Contributor Author

sebgl commented Dec 3, 2020

run full pr build

@sebgl sebgl merged commit 34a1209 into elastic:master Dec 4, 2020
sebgl added a commit to sebgl/cloud-on-k8s that referenced this pull request Dec 4, 2020
…elastic#3992)

* Don't set an ownerRef on secrets users are susceptible to copy around

A k8s bug (kubernetes/kubernetes#65200) may cause the
k8s garbage collection to delete undesired resources in case users manually
copy an operator-managed secret to another namespace.

To avoid that situation, this commit ensures no ownerRef is set on a subset of
managed secrets users are susceptible to copy around:
- the elastic user password secret
- elasticsearch public transport certs
- elasticsearch, kibana, enterprise search, apm server public http certs

Existing ownerReferences set with earlier ECK versions will be removed when
reconciled.

Since they do not have an ownerRef anymore, those secrets are not automatically
deleted when the Elasticsearch resource is deleted. To work around that situation,
the secret reconciliation logic adds an additional set of labels to the reconciled
secrets that don't have an ownerRef specified. These labels reference the "soft"
owner ("soft" as in handled through some custom code and not through k8s builtin
garbage collection logic).
Once a controller receives a deletion event for the resource it manages, it will
automatically remove the soft-owned secrets.
This is done as best-effort. Secrets will remain orphan if:
- the operator is not running when the owner is deleted
- an error happens while deleting the soft-owned secrets

* Error out the reconciliation on garbage collection errors

* Improvements from PR review

* Label soft-owned secrets with the owner Kind

* Reuse Kind constants instead of hardcoded string

* Fix existing unit tests

* Garbage collect orphan soft-owned secrets on operator startup

* Fix linter warnings

* Add unit tests

* Cover soft owned secrets gc in e2e tests

* Fix linter warning (typo)

* Improvements from PR review

* Trigger reconciliations on soft-owned secrets events

* Fix linter stuff
sebgl added a commit that referenced this pull request Dec 4, 2020
…#3992) (#4008)

* Don't set an ownerRef on secrets users are susceptible to copy around

A k8s bug (kubernetes/kubernetes#65200) may cause the
k8s garbage collection to delete undesired resources in case users manually
copy an operator-managed secret to another namespace.

To avoid that situation, this commit ensures no ownerRef is set on a subset of
managed secrets users are susceptible to copy around:
- the elastic user password secret
- elasticsearch public transport certs
- elasticsearch, kibana, enterprise search, apm server public http certs

Existing ownerReferences set with earlier ECK versions will be removed when
reconciled.

Since they do not have an ownerRef anymore, those secrets are not automatically
deleted when the Elasticsearch resource is deleted. To work around that situation,
the secret reconciliation logic adds an additional set of labels to the reconciled
secrets that don't have an ownerRef specified. These labels reference the "soft"
owner ("soft" as in handled through some custom code and not through k8s builtin
garbage collection logic).
Once a controller receives a deletion event for the resource it manages, it will
automatically remove the soft-owned secrets.
This is done as best-effort. Secrets will remain orphan if:
- the operator is not running when the owner is deleted
- an error happens while deleting the soft-owned secrets

* Error out the reconciliation on garbage collection errors

* Improvements from PR review

* Label soft-owned secrets with the owner Kind

* Reuse Kind constants instead of hardcoded string

* Fix existing unit tests

* Garbage collect orphan soft-owned secrets on operator startup

* Fix linter warnings

* Add unit tests

* Cover soft owned secrets gc in e2e tests

* Fix linter warning (typo)

* Improvements from PR review

* Trigger reconciliations on soft-owned secrets events

* Fix linter stuff
@jberger
Copy link

jberger commented Apr 28, 2022

I guess I understand why this is done however it does mean that those generated resources do not show up in tools like Argo. When lots of resources are generated being visible from tooling is helpful so as to understand what has even been generated. Would there be any appetite to revisit this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.3.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Work around Kubernetes bug causing garbage collection across namespaces to happen when ownerReferences are set
4 participants