Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Decommission is broken #542

Open
keith-mcclellan opened this issue May 28, 2021 · 18 comments
Open

[Regression] Decommission is broken #542

keith-mcclellan opened this issue May 28, 2021 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@keith-mcclellan
Copy link
Contributor

keith-mcclellan commented May 28, 2021

The statefulset is stopping the pod before cockroach node decommission is being executed, so the node is showing up as failed instead of as decommissioned.

The PVC also gets deleted so there is no way to recover from this state if the scale down caused a loss of quorum because the data gets destroyed.

Steps to reproduce:
create 4 node cluster
change node count to 3 nodes
Screen Shot 2021-05-28 at 12 26 31 PM
decommission-regression.log

@keith-mcclellan keith-mcclellan added bug Something isn't working release-blocker labels May 28, 2021
@keith-mcclellan
Copy link
Contributor Author

the PVC gets deleted as well so there is no way to recover from this state.

@alinadonisa alinadonisa self-assigned this May 31, 2021
@alinadonisa
Copy link
Contributor

the PVC gets deleted as well so there is no way to recover from this state.

@keith-mcclellan so we keep the PVC, and we delete them only if the decommission command it is successfull?
@chrisseto I moved the Prune after after the downscale code. Now the reason that the decommission gave an error is still to be determined.

// Before doing any scaling, prune any PVCs that are not currently in use.
	// This only needs to be done when scaling up but the operation is a noop
	// if there are no PVCs not currently in use.
	// As of v20.2.0, CRDB nodes may not be recommissioned. To account for
	// this, PVCs must be removed (pruned) before scaling up to avoid reusing a
	// previously decommissioned node.
	// Prune MUST be called before scaling as older clusters may have dangling
	// PVCs.
	// All underlying PVs and the storageclasses they were created with should
	// make use of reclaim policy = delete. A reclaim policy of retain is fine
	// but will result in wasted money, recycle should be considered unsafe and
	// is officially deprecated by kubernetes.
	if err := s.PVCPruner.Prune(ctx); err != nil {
		return errors.Wrap(err, "initial PVC pruning")
	}

@keith-mcclellan
Copy link
Contributor Author

keith-mcclellan commented Jun 1, 2021

Decommission should run as follows functionally:

  1. Validate that the node count after decommission is still >= 3 (node decommissions CAN be run in parallel)
  2. Run cockroach node decommission
  3. If fails, annotate the CR to stop further runs without user input, log an error, and reset the node count to the original amount
    *** optionally, we should roll back the decommission with a recommission command ***
  4. If successful, wait 60 seconds and run cockroach node status --decommission (or optionally cockroach node status --all) to validate that the node is decommissioned and the database is ready for it to exit the cluster
  5. If cockroach node status --decommission does not show the node as decommissioned, do same as Enable basic configuration of CRDB pods #3
  6. Stop the decommissioned pod gracefully (pre-stop hook et al same as a rolling restart)
  7. Run health checker to validate that we have 0 under-replicated ranges using the same pattern as a rolling restart
  8. Delete the SS and PVC

Decommission tests:

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Positive case 2 -
Annotate the CR and validate that the decommission actor doesn't run on a CR update

Negative case 1 -
While decommission is running, stop the pod being decommissioned. This should cause decommission to fail. We should verify that the rest of the decommission process doesn't proceed and that the annotation is set.

Negative case 2 -
Change node count after annotation is set, operator should throw an error

@udnay

@udnay
Copy link

udnay commented Jun 1, 2021

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

Positive case 2 -
Annotate the CR and validate that the decommission actor doesn't run on a CR update

This should be a unit-test, I think we just need to test the handle command for the actor.

Negative case 2 -
Change node count after annotation is set, operator should throw an error

This should also be a unit-test

@keith-mcclellan
Copy link
Contributor Author

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

You can, but it's got an interface so you can run it from anywhere that you have the database binary.

@keith-mcclellan
Copy link
Contributor Author

keith-mcclellan commented Jun 1, 2021

cockroach node status --decommission --certs-dir=certs --host=<address of any live node>

 id |        address         |  build  |            started_at            |            updated_at            | is_available | is_live | gossiped_replicas | is_decommissioning | is_draining  
+---+------------------------+---------+----------------------------------+----------------------------------+--------------+---------+-------------------+--------------------+-------------+
  1 | 165.227.60.76:26257    | 91a299d | 2018-10-01 16:53:10.946245+00:00 | 2018-10-02 14:04:39.280249+00:00 |         true |  true   |                26 |       false        |    false     
  2 | 192.241.239.201:26257  | 91a299d | 2018-10-01 16:53:24.22346+00:00  | 2018-10-02 14:04:39.415235+00:00 |         true |  true   |                26 |       false        |    false     
  3 | 67.207.91.36:26257     | 91a299d | 2018-10-01 17:34:21.041926+00:00 | 2018-10-02 14:04:39.233882+00:00 |         true |  true   |                25 |       false        |    false     
  4 | 138.197.12.74:26257    | 91a299d | 2018-10-01 17:09:11.734093+00:00 | 2018-10-02 14:04:37.558204+00:00 |         true |  true   |                25 |       false        |    false     
  5 | 174.138.50.192:26257   | 91a299d | 2018-10-01 17:14:01.480725+00:00 | 2018-10-02 14:04:39.293121+00:00 |         true |  true   |                 0 |        true        |    false   

This is an example of a decommissioned node that is ready to be stopped - is_decommissioning is true and gossiped_replicas = 0 means its done. We can then gracefully stop the pod.

@chrislovecnm
Copy link
Contributor

@udnay Please document what the correct workflow is for decommision. We are getting differing opinions.

@chrisseto
Copy link
Contributor

Weighing in on behalf of @udnay and at the request of @alinadonisa.

The logic in EnsureScale is the same logic that we use in cockroach cloud, which has been pretty well battled tested at this point. The one notable difference is that we remove Kubernetes nodes in the CC version, but that doesn't affect to core logic.

From what I can tell, that logic simply isn't running to completion or isn't running at all. Does anyone have the logs available of a failed decommission? The decommissioner is very verbose, it should be pretty easy to tell where something is going wrong based on the logs.

The PVC pruner will only remove the volumes of pods that are not currently running and have an ordinal less than the number of desired replicas. It sounds like something is changing the desired number of replicas out side of the call to EnsureScale.

It does everything that Keith has suggested sans the under replicated system check but that could be easily plugged into WaitUntilHealthy function.

@udnay
Copy link

udnay commented Jun 1, 2021

Looking at the logs attached I see

{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079463,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079763,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095117,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095623,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5095706,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257

I see a failure in the decomission code due to maybe port forwarding or something because the operator isn't running in cluster ailed to start draining node 4: failed to stream execution results back: command terminated with exit code 1",

Then we start to run the deploy actor runs and I see
{"level":"info","ts":1622218735.5339012,"logger":"action","msg":"created/updated statefulset, stopping request processing","action":"deploy","CrdbCluster":"default/crdb-tls-example"} Which seems to be Lines 116-134

What does reconciler do? Can it be shutting down the extra pod after decommsion failed?

@udnay
Copy link

udnay commented Jun 1, 2021

After this the logs show decommission failing because not all replicas are up, I believe @chrisseto is probably correct or at least onto something.

@chrisseto
Copy link
Contributor

Seems like the error handling for failed decommissioning is busted? Though I'm not sure why the decommission command would fail...

@keith-mcclellan
Copy link
Contributor Author

keith-mcclellan commented Jun 1, 2021

I'm not questioning that the cc drainer works properly, I'm questioning whether we implemented it properly. Something is stopping the pod before the decommission is complete...see

{"level":"warn","ts":1622218688.9847136,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0102832,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.010324,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0103347,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.010339,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.015668,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0157423,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158072,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.015832,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158622,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":4}
{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158906,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0159137,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167642,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167942,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0168018,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0354652,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0355067,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355196,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355241,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0407238,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8642936,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644145,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.8644257,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644555,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":3}
{"level":"warn","ts":1622218733.8682542,"logger":"action","msg":"operator is running inside of kubernetes, connecting to service for db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.892141,"logger":"action","msg":"opened db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.9547343,"logger":"action","msg":"established statefulset watch","action":"decommission","name":"crdb-tls-example","namespace":"default"}
{"level":"warn","ts":1622218733.9649646,"logger":"action","msg":"scaling down stateful set","action":"decommission","have":4,"want":3}
{"level":"info","ts":1622218734.250925,"logger":"action","msg":"draining node","action":"decommission","NodeID":4}
{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

I think the problem is the opposite to @chrisseto - I think another actor is running and that other actor is stopping the pod because it sees that nodes = 3 in the CR instead of nodes =4 and thinks it should shut it down. But because decommission is running, the decommission fails and the decommission never gets started back up. Which is then having the node showing up as failed.

If I'm reading this right, it's this that's stopping the pod that we're waiting to decommission:

{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

Decommission should be a blocking operation - ie the operator should not to any other work until the decommission is complete. And if the decommission fails we shouldn't allow the PVC pruner to run.

@chrislovecnm
Copy link
Contributor

@udnay this needs manual testing, but removing release blocker

@sabuhigr
Copy link

sabuhigr commented Oct 5, 2022

Any updates about it? I'm still faced the same issue in v22.1.2

@udnay
Copy link

udnay commented Aug 9, 2023

@prafull01 @himanshu-cockroach can one of you take a look?

@prafull01
Copy link
Collaborator

I will take a look

@prafull01
Copy link
Collaborator

I have tried this and on the latest version I am not able to reproduce this issue:

When you see the UI it shows 3 Live nodes and 1 Decommissioned node.
Screenshot 2023-08-14 at 2 49 14 PM

I can see the additional PVC as well on the cluster:
Screenshot 2023-08-14 at 2 51 52 PM

I have tested this on cockroach operator version v2.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants