Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix76: Scale STS / PVC vertically to support resizing of druid nodes #97

Merged
merged 6 commits into from
Oct 13, 2021

Conversation

AdheipSingh
Copy link
Contributor

@AdheipSingh AdheipSingh commented Dec 1, 2020

Fixes #76

Description

Sequence Flow

- For every node in array of nodes do the following if the node is statefulset
1. Get sts set
2. Get pvc referenced by that sts on label
3. GetDesiredPVCSize from CR
4. GetCurrentPVCSize from sts 
5. Validate the request, shrinkage of pvc not possible, so currentSize cannot be greater than desiredSize, reject request here
5. if GetDesiredPVCSize != GetCurrentPVCSize {
    then delete sts with cascade false
    }
6. Operator on reconcile created the sts, since it sees their is no sts present
7. GetPVCSizeofSts from pvc
8. If GetDesiredPVCSize != GetPVCSizeofSts {
      using deepcopy patch the pvc, dont delete the pvc, no use of update CRUD just patch 
   }

Know Issues

  • This function to be called before STS Creation starts, so on a new druid where no sts is created this function will log saying it cannot find any sts for that nodeType ( cold, hot, mm etc )
  • The program does not return here, continues to other flow, after sts is created on next reconcile this error should go away
  • 10 sec reconcile being a bit fast, so at times patch may take few more sec, so their might be a error log saying sts could not be updated, but goes away quickly

TODO:

  • the operator should not allow shrinkage of pvc claims, this should be validated by the operator and logged/sendEvents.
    Here's a sample log which shall be thrown
{"level":"error","ts":1606744048.5905652,"logger":"druid_operator_handler","msg":"Failed to Patch [PVC:cold-data-master-in-druid-historicals-cold-0] due to [PersistentVolumeClaim \"cold-data-master-in-druid-historicals-cold-0\" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value]","error":"Failed to Patch [PVC:cold-data-master-in-druid-historicals-cold-0] due to [PersistentVolumeClaim \"cold-data-master-in-druid-historicals-cold-0\" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value]"
  • To validate that no other change apart from pvc size in condition

This PR has:

  • been tested on a real K8S cluster to ensure creation of a brand new Druid cluster works.
  • been tested for backward compatibility on a real K*S cluster by applying the changes introduced here on an existing Druid cluster. If there are any backward incompatible changes then they have been noted in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Key changed/added files in this PR
  • handler.go

@himanshug @nishantmonu51

@AdheipSingh
Copy link
Contributor Author

Here is a sequence of logs

{"level":"info","ts":1606914406.0898578,"logger":"druid_operator_handler","msg":"Detected Change in VolumeClaimTemplate Sizes for Statefuleset [master-in-druid-historicals-cold] in Namespace [master-in], desVolumeClaimTemplateSize: [74Gi], currVolumeClaimTemplateSize: [73Gi]\n, deleteing STS [master-in-druid-historicals-cold] with casacde=false]"}
{"level":"info","ts":1606914407.2999537,"logger":"druid_operator_handler","msg":"[StatefuleSet:master-in-druid-historicals-cold] successfully deleted with casacde=false","name":"master-in","namespace":"master-in"}
{"level":"info","ts":1606914407.3025851,"logger":"druid_operator_handler","msg":"[PVC:cold-data-master-in-druid-historicals-cold-0] successfully Patched with [Size:73Gi]","name":"master-in","namespace":"master-in"}
{"level":"error","ts":1606914408.527233,"logger":"druid_operator_handler","msg":"Failed to update [StatefulSet:master-in-druid-historicals-cold] due to [StatefulSet.apps \"master-in-druid-historicals-cold\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden]

So the last log is fine, since it takes a bit time for reconcile to run and create new sts with the desired spec.

@AdheipSingh
Copy link
Contributor Author

image

@AdheipSingh AdheipSingh changed the title [WIP] Fix76: scalePvc seamless Fix76: scalePvc seamless Dec 31, 2020
@AdheipSingh
Copy link
Contributor Author

@himanshug can you pls review this

@AdheipSingh
Copy link
Contributor Author

@himanshug
Copy link
Member

just updating that I am aware of this one ... I did take a quick look today and I maybe wrong but it felt somewhat complex on first look for handling a very special case. I couldn't spend enough time to think through about how best to simplify it though or if that is even possible.

@AdheipSingh
Copy link
Contributor Author

just updating that I am aware of this one ... I did take a quick look today and I maybe wrong but it felt somewhat complex on first look for handling a very special case. I couldn't spend enough time to think through about how best to simplify it though or if that is even possible.

:)
i am pretty sure this is the best way to upgrade sts, we have been using this in-house and haven't faced any downtime or disruptions during updates. The only function which i am not using is the checkIfStsHasEverythingElseSame since for me i really dont care if anything is updated or not. So thats why i requested your review/suggestion on the best way to check that ( open to change )

The only hack which i used is to sleep , this was just used so that the operator does not throw an error log since after deletion it takes some time to upgrade, regardless of the error log their is no issue, just wanted to keep logs clean :)

time.Sleep(10 * time.Second)

ill be happy to explain any issues/queries

@AdheipSingh
Copy link
Contributor Author

@himanshug any update on this here ?

controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
controllers/druid/handler.go Outdated Show resolved Hide resolved
@AdheipSingh AdheipSingh marked this pull request as draft January 24, 2021 17:40
@AdheipSingh AdheipSingh marked this pull request as ready for review June 7, 2021 03:22
@AdheipSingh
Copy link
Contributor Author

@himanshug over to you for review.

controllers/druid/handler.go Outdated Show resolved Hide resolved
@AdheipSingh AdheipSingh changed the title Fix76: scalePvc seamless Fix76: Scale STS / PVC vertically to support size resizing of druid nodes Oct 13, 2021
Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nishantmonu51 nishantmonu51 merged commit e2e1db9 into druid-io:master Oct 13, 2021
@AdheipSingh AdheipSingh changed the title Fix76: Scale STS / PVC vertically to support size resizing of druid nodes Fix76: Scale STS / PVC vertically to support resizing of druid nodes Oct 13, 2021
CodingParsley added a commit to confluentinc/druid-operator that referenced this pull request Jan 5, 2023
* Restrict K8s role definitions to actually needed roles (druid-io#226)

* PK | druid-io#255 | Restrict K8s role definitions to actually needed roles

* PK | druid-io#255 | Restrict K8s role definitions to actually needed roles - templates/

* PK | druid-io#255 | Add patch to event and newline at the end of the file

* PK | druid-io#255 | Add patch to event to templates

Co-authored-by: Piotr Kmita <piotr.kmita@zalando.de>

* Introduce State Driven Status, use record pkg for event and clean logging (druid-io#223)

* status v1, records

* use druid constructor in main.go

* fix smoke tests

* fix comments

* update delete event, status

* push emitter changes

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* update to go.mod to k8s 1.22 (druid-io#237)

* update to k8s 1.22

* update go.mod

* use 1.19.2 for suit_test

* revert yaml change

* update identation/spaces in tiny

* update identation/spaces in tiny

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* Create watched_namespace.yaml (druid-io#248)

* Create watched_namespace.yaml

Create namespace that is being watched by Druid when declared and not being the default

* Update watched_namespace.yaml

Add some clarifications on comments

* Update watched_namespace.yaml

Fix to create all required namespaces by WATCH_NAMESPACE env var (comma separated list. This is intended to handle this future feature) only when they are not created yet

* Fix Helm Chart type error  (druid-io#253)

* run helm lint github actions

* add helm lint, template in CI

* Fix76: Scale STS / PVC vertically to support size resizing of druid nodes (druid-io#97)

* add storage class expansion check, generation check, refactor

* add flag for scalePvcSts

* add component label, update crd

* fix unit tests

* identation in testdata

* Fixing the image tag to be pulled for druid-operator (druid-io#259)

* the image tag fixed to druidio/druid-operator:latest

* Update Druid Operator Documentation (druid-io#255)

* update docs

* update typos

* update docs

* udpate docs

* Adding support for custom metricDimensions.json for statsd metrics emitter (druid-io#250)

* Adding support for custom metricDimensions.json.

* Addressing Adheip's comments.

* Fixing test failure.

* add patch to pvc rbac (druid-io#261)

* add support for druid 0.22.1 and zookeeper 3.7.0 (druid-io#267)

* add support for druid 0.22.1 and zookeeper 3.7.0

* Update tiny-cluster-zk.yaml

change ZooKeeper version from 3.7 to 3.7.0

* update appVersion to 0.0.8-release (druid-io#238)

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* Fix-214-Merge-Labels (druid-io#273)

* add health checks to operator manage (druid-io#278)

* safe guard pod pending deletion of pvc (druid-io#289)

* make nodeselector consistent (druid-io#284)

* make nodeselector consitent

* update handler

* fix revert, scale down pods replicas to zero (druid-io#283)

* fix typo for startupProbe (druid-io#282)

* Bump up chart version to 0.1.2 (druid-io#271) (druid-io#272)

* Watch Multiple Namespaces (druid-io#240)

* WIP: on druid k8s extension, do, and watch ns

* support for multiple namespaces to watch

* support for multiple namespaces to watch

* fix typo

* fix identation

* Release 0.0.9 - Go Mod and Dockerfile Update (druid-io#291)

* go modules update

* add icon

* update dockerfile

* fix tests

* update helm version

* max reconciles workerqueue (druid-io#299)

* emit detect underlying object type on CRUD (druid-io#300)

* make crd optional (druid-io#307)

* multiple watched namespaces (druid-io#309)

* Sidecar container spec for druid (druid-io#296)

* Sidecar container spec for druid

* Fix PR comments on additionalcontainer

* Add envFrom and add comments

* Add example spec to example.md

* Add multi container deployment to features.md

Co-authored-by: cinto <cinto@apple.com>

* update makefile, tests, controller-gen (druid-io#315)

* update make

* update makefile

* Fix-312-Run Test on CI, Controller Gen Version Update (druid-io#319)

* update fixes

* add more tests

* fix test kubebuilder

* seperate build

* revert build

Co-authored-by: AdheipSingh <adheips1222@gmail.com>

Co-authored-by: piotrkmita <43473995+piotrkmita@users.noreply.github.com>
Co-authored-by: Piotr Kmita <piotr.kmita@zalando.de>
Co-authored-by: AdheipSingh <34169002+AdheipSingh@users.noreply.github.com>
Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>
Co-authored-by: Alby Hernández <61636487+achetronic@users.noreply.github.com>
Co-authored-by: shrutimantri <shruti1810@gmail.com>
Co-authored-by: Harini Rajendran <harini.rajendran@yahoo.com>
Co-authored-by: RoelofKuijpers <roelof@datanetic.com>
Co-authored-by: Youngwoo Kim <ywkim@apache.org>
Co-authored-by: Vladislav <vladislavPV@users.noreply.github.com>
Co-authored-by: Zhang Lu <91473238+zhangluva@users.noreply.github.com>
Co-authored-by: cintoSunny <67714887+cintoSunny@users.noreply.github.com>
Co-authored-by: cinto <cinto@apple.com>
Co-authored-by: AdheipSingh <adheips1222@gmail.com>
CodingParsley added a commit to confluentinc/druid-operator that referenced this pull request Jan 6, 2023
* Restrict K8s role definitions to actually needed roles (druid-io#226)

* PK | druid-io#255 | Restrict K8s role definitions to actually needed roles

* PK | druid-io#255 | Restrict K8s role definitions to actually needed roles - templates/

* PK | druid-io#255 | Add patch to event and newline at the end of the file

* PK | druid-io#255 | Add patch to event to templates

Co-authored-by: Piotr Kmita <piotr.kmita@zalando.de>

* Introduce State Driven Status, use record pkg for event and clean logging (druid-io#223)

* status v1, records

* use druid constructor in main.go

* fix smoke tests

* fix comments

* update delete event, status

* push emitter changes

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* update to go.mod to k8s 1.22 (druid-io#237)

* update to k8s 1.22

* update go.mod

* use 1.19.2 for suit_test

* revert yaml change

* update identation/spaces in tiny

* update identation/spaces in tiny

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* Create watched_namespace.yaml (druid-io#248)

* Create watched_namespace.yaml

Create namespace that is being watched by Druid when declared and not being the default

* Update watched_namespace.yaml

Add some clarifications on comments

* Update watched_namespace.yaml

Fix to create all required namespaces by WATCH_NAMESPACE env var (comma separated list. This is intended to handle this future feature) only when they are not created yet

* Fix Helm Chart type error  (druid-io#253)

* run helm lint github actions

* add helm lint, template in CI

* Fix76: Scale STS / PVC vertically to support size resizing of druid nodes (druid-io#97)

* add storage class expansion check, generation check, refactor

* add flag for scalePvcSts

* add component label, update crd

* fix unit tests

* identation in testdata

* Fixing the image tag to be pulled for druid-operator (druid-io#259)

* the image tag fixed to druidio/druid-operator:latest

* Update Druid Operator Documentation (druid-io#255)

* update docs

* update typos

* update docs

* udpate docs

* Adding support for custom metricDimensions.json for statsd metrics emitter (druid-io#250)

* Adding support for custom metricDimensions.json.

* Addressing Adheip's comments.

* Fixing test failure.

* add patch to pvc rbac (druid-io#261)

* add support for druid 0.22.1 and zookeeper 3.7.0 (druid-io#267)

* add support for druid 0.22.1 and zookeeper 3.7.0

* Update tiny-cluster-zk.yaml

change ZooKeeper version from 3.7 to 3.7.0

* update appVersion to 0.0.8-release (druid-io#238)

Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>

* Fix-214-Merge-Labels (druid-io#273)

* add health checks to operator manage (druid-io#278)

* safe guard pod pending deletion of pvc (druid-io#289)

* make nodeselector consistent (druid-io#284)

* make nodeselector consitent

* update handler

* fix revert, scale down pods replicas to zero (druid-io#283)

* fix typo for startupProbe (druid-io#282)

* Bump up chart version to 0.1.2 (druid-io#271) (druid-io#272)

* Watch Multiple Namespaces (druid-io#240)

* WIP: on druid k8s extension, do, and watch ns

* support for multiple namespaces to watch

* support for multiple namespaces to watch

* fix typo

* fix identation

* Release 0.0.9 - Go Mod and Dockerfile Update (druid-io#291)

* go modules update

* add icon

* update dockerfile

* fix tests

* update helm version

* max reconciles workerqueue (druid-io#299)

* emit detect underlying object type on CRUD (druid-io#300)

* make crd optional (druid-io#307)

* multiple watched namespaces (druid-io#309)

* Sidecar container spec for druid (druid-io#296)

* Sidecar container spec for druid

* Fix PR comments on additionalcontainer

* Add envFrom and add comments

* Add example spec to example.md

* Add multi container deployment to features.md

Co-authored-by: cinto <cinto@apple.com>

* update makefile, tests, controller-gen (druid-io#315)

* update make

* update makefile

* Fix-312-Run Test on CI, Controller Gen Version Update (druid-io#319)

* update fixes

* add more tests

* fix test kubebuilder

* seperate build

* revert build

Co-authored-by: AdheipSingh <adheips1222@gmail.com>

* Adding minikube-setup instructions.

Update tiny-cluster.yaml to make it work in minikube.

Fixing operator startup bug.

Miscellaneous fixes to make local minikube setup work.

Fixing MM readiness probe and steps for minikube ingress-dns issue for local minikube setup.

* semaphore

path

generate

fix format

init-ci

add ci bin

path

kubebuilder move

dummy change

* Add command args to container creation (#6)

* go/codeowners: Generate CODEOWNERS [ci skip] (#7)

* [METRICS-4348] update obs-data team as codeownderswq (#8)

* [METRICS-4487] add obs-oncall as codeowners (#10)

* DP-8085 - Migrate to Sempahore self-hosted agent (#9)

* DP-9370 - Migrate to Semaphore self-hosted agent (#15)

* chore: update repo semaphore project

* Update service.yml (#17)

Update semaphore job commands and go version

update build triggers

update semaphore whitelist

Update project.yml

* update CI 

* nodespec should take precedence (#13)

Co-authored-by: piotrkmita <43473995+piotrkmita@users.noreply.github.com>
Co-authored-by: Piotr Kmita <piotr.kmita@zalando.de>
Co-authored-by: AdheipSingh <34169002+AdheipSingh@users.noreply.github.com>
Co-authored-by: AdheipSingh <adheip.singh@rilldata.com>
Co-authored-by: Alby Hernández <61636487+achetronic@users.noreply.github.com>
Co-authored-by: shrutimantri <shruti1810@gmail.com>
Co-authored-by: Harini Rajendran <harini.rajendran@yahoo.com>
Co-authored-by: RoelofKuijpers <roelof@datanetic.com>
Co-authored-by: Youngwoo Kim <ywkim@apache.org>
Co-authored-by: Vladislav <vladislavPV@users.noreply.github.com>
Co-authored-by: Zhang Lu <91473238+zhangluva@users.noreply.github.com>
Co-authored-by: cintoSunny <67714887+cintoSunny@users.noreply.github.com>
Co-authored-by: cinto <cinto@apple.com>
Co-authored-by: AdheipSingh <adheips1222@gmail.com>
Co-authored-by: Harini Rajendran <hrajendran@confluent.io>
Co-authored-by: Luke Young <91491244+lyoung-confluent@users.noreply.github.com>
Co-authored-by: Yun Fu <fuyun12345@gmail.com>
Co-authored-by: nlou9 <39046184+nlou9@users.noreply.github.com>
Co-authored-by: Corey Christous <cchristous@gmail.com>
Co-authored-by: Confluent Jenkins Bot <jenkins@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator Should Re-Create Sts when k8s API blocks Sts updates
3 participants