Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure disks removal after removing cluster in GKE #1163

Merged
merged 6 commits into from
Jul 8, 2019

Conversation

artemnikitin
Copy link
Member

Thanks to @barkbay, there was found that after killing clusters in GKE disks from those instances are not deleted automatically. This change introducing a step for all CI jobs running in GKE to remove disks properly.

close #1162

@artemnikitin artemnikitin requested a review from barkbay June 28, 2019 04:22
@artemnikitin artemnikitin changed the title 28 eck snapshot build Ensure disks removal after removing cluster in GKE Jun 28, 2019
@artemnikitin artemnikitin requested a review from sebgl July 1, 2019 06:23
@artemnikitin artemnikitin requested a review from sebgl July 1, 2019 10:09
@artemnikitin
Copy link
Member Author

Looks like infra is doing something similar to this. I will contact them to sync our efforts, maybe they can do that on their side.

@artemnikitin
Copy link
Member Author

infra solution is only for them.

@@ -144,3 +144,15 @@ ci-e2e-delete-cluster: vault-gke-creds
-e "GKE_SERVICE_ACCOUNT_KEY_FILE=$(GO_MOUNT_PATH)/build/ci/$(GKE_CREDS_FILE)" \
cloud-on-k8s-ci-e2e \
bash -c "make -C operators set-context-gke delete-gke"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, everytime we delete a gke cluster, the underlying disks are not deleted. This applies to e2e tests but also to our own clusters, right?
Should we do this everytime, as part of the delete-gke target?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix applies to all disks, not only from clusters for e2e tests. No need to explicitly run delete-gke

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But isn't there a bug in the delete-gke target (that we use ourselves) if the disks are not deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, if it runs and some unused disks are left, then it's a bug. I only run it before submitting original PR and it cleaned up everything at that moment.

@artemnikitin
Copy link
Member Author

jenkins test this please

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 4, 2019

I would like to understand what is the cause of this issue.
I just deleted a GKE cluster and watched the disks at the same time and the disks were automatically deleted. The statuses of the disks are immediately passed into DELETING.

> gcloud compute disks list  | grep thb
gke-thb-e2e-cluster-default-pool-5b7e16fd-ppr0                   europe-west1-d  zone            30       pd-ssd       DELETING
gke-thb-e2e-cluster-default-pool-f3c9866f-rw75                   europe-west1-b  zone            30       pd-ssd       DELETING
gke-thb-e2e-cluster-default-pool-106593c7-3r5n                   europe-west1-c  zone            30       pd-ssd       DELETING

What is "killing clusters in GKE" compared to "deleting clusters"?

@artemnikitin
Copy link
Member Author

@thbkrkr yeah you right, right now it's started to delete disks right after instances are deleted, but it wasn't the case in the past. Either something changed on GCP side or it's infra job for removing disks (mentioned in https://github.com/elastic/infra/pull/12703) I will try to figure it out

What is "killing clusters in GKE" compared to "deleting clusters"?

It's the same 😄

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 4, 2019

I would like to understand what is the cause of this issue.

I think the reason is that PVs were not deleted before the cluster was deleted.

@artemnikitin
Copy link
Member Author

jenkins test this please

@artemnikitin artemnikitin merged commit 899ee31 into elastic:master Jul 8, 2019
@artemnikitin artemnikitin deleted the 28-eck-snapshot-build branch July 8, 2019 09:02
sebgl added a commit that referenced this pull request Jul 12, 2019
* Support for APM server configuration (#1181)

* Add a config section to the APM server configuration

* APM: Add support for keystore

* Factorize ElasticsearchAuthSettings

* Update dev setup doc + fix GKE bootstrap script (#1203)

* Update dev setup doc + fix GKE bootstrap script

* Update wording of container registry authentication

* Ensure disks removal after removing cluster in GKE (#1163)

* Update gke-cluster.sh

* Implement cleanup for unused disks in GCP

* Update Makefile

* Update CI jobs to do proper cleanup

* Normalize the raw config when creating canonical configs (#1208)

This aims at counteracting the difference between JSON centric serialization and the use of YAML as the serialization format in canonical config. If not normalizing numeric values
like 1 will differ when comparing configs as JSON deserializes integer numbers to float64 and YAML to uint64.

* Homogenize logs (#1168)

* Don't run tests if only docs are changed (#1216)

* Update Jenkinsfile

* Simplify notOnlyDocs()

* Update Jenkinsfile

* Push snapshot ECK release on successful PR build (#1184)

* Update makefile's to support snapshots

* Add snapshot releases to Jenkins pipelines

* Cleanup

* Rename RELEASE to USE_ELASTIC_DOCKER_REGISTRY

* Update Jenkinsfile

* Add a note on EKS inbound traffic & validating webhook (#1211)

EKS users must explicitly enable communication from the k8s control
plane and nodes port 443 in order for the control plane to reach the
validating webhook.

Should help with #896.

* Update PodSpec with Hostname from PVC when re-using (#1204)

* Bind the Debug HTTP server to localhost by default (#1220)

* Run e2e tests against custom Docker image (#1135)

* Add implementation

* Update makefile's

* Update Makefile

* Rename Jenkisnfile

* Fix review comments

* Update e2e-custom.yml

* Update e2e-custom.yml

* Return deploy-all-in-one to normal

* Delete GKE cluster only if changes not in docs (#1223)

* Add operator version to resources (#1224)

* Warn if unsupported distribution (#1228)

The operator only works with the official ES distributions to enable the security
available with the basic (free), gold and platinum licenses in order to ensure that
all clusters launched are secured by default.

A check is done in the prepare-fs script by looking at the existence of the
Elastic License. If not present, the script exit with a custom exit code.

Then the ES reconcilation loop sends an event of type warning if it detects that
a prepare-fs init container terminated with this exit code.

* Document Elasticsearch update strategy change budget & groups (#1210)

Add documentation for the `updateStrategy` section of the Elasticsearch
spec.

It documents how (and why) `changeBudget` and `groups` are used by ECK,
and how both settings can be specified by the user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proper removal of disks after deletion of cluster in GKE
3 participants