[Regression] Decommission is broken #542

keith-mcclellan · 2021-05-28T16:24:56Z

The statefulset is stopping the pod before cockroach node decommission is being executed, so the node is showing up as failed instead of as decommissioned.

The PVC also gets deleted so there is no way to recover from this state if the scale down caused a loss of quorum because the data gets destroyed.

Steps to reproduce:
create 4 node cluster
change node count to 3 nodes

decommission-regression.log

The text was updated successfully, but these errors were encountered:

keith-mcclellan · 2021-05-28T16:45:04Z

the PVC gets deleted as well so there is no way to recover from this state.

alinadonisa · 2021-05-31T18:41:49Z

the PVC gets deleted as well so there is no way to recover from this state.

@keith-mcclellan so we keep the PVC, and we delete them only if the decommission command it is successfull?
@chrisseto I moved the Prune after after the downscale code. Now the reason that the decommission gave an error is still to be determined.

// Before doing any scaling, prune any PVCs that are not currently in use.
	// This only needs to be done when scaling up but the operation is a noop
	// if there are no PVCs not currently in use.
	// As of v20.2.0, CRDB nodes may not be recommissioned. To account for
	// this, PVCs must be removed (pruned) before scaling up to avoid reusing a
	// previously decommissioned node.
	// Prune MUST be called before scaling as older clusters may have dangling
	// PVCs.
	// All underlying PVs and the storageclasses they were created with should
	// make use of reclaim policy = delete. A reclaim policy of retain is fine
	// but will result in wasted money, recycle should be considered unsafe and
	// is officially deprecated by kubernetes.
	if err := s.PVCPruner.Prune(ctx); err != nil {
		return errors.Wrap(err, "initial PVC pruning")
	}

keith-mcclellan · 2021-06-01T15:47:54Z

Decommission should run as follows functionally:

Validate that the node count after decommission is still >= 3 (node decommissions CAN be run in parallel)
Run cockroach node decommission
If fails, annotate the CR to stop further runs without user input, log an error, and reset the node count to the original amount
*** optionally, we should roll back the decommission with a recommission command ***
If successful, wait 60 seconds and run cockroach node status --decommission (or optionally cockroach node status --all) to validate that the node is decommissioned and the database is ready for it to exit the cluster
If cockroach node status --decommission does not show the node as decommissioned, do same as Enable basic configuration of CRDB pods #3
Stop the decommissioned pod gracefully (pre-stop hook et al same as a rolling restart)
Run health checker to validate that we have 0 under-replicated ranges using the same pattern as a rolling restart
Delete the SS and PVC

Decommission tests:

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Positive case 2 -
Annotate the CR and validate that the decommission actor doesn't run on a CR update

Negative case 1 -
While decommission is running, stop the pod being decommissioned. This should cause decommission to fail. We should verify that the rest of the decommission process doesn't proceed and that the annotation is set.

Negative case 2 -
Change node count after annotation is set, operator should throw an error

@udnay

keith-mcclellan · 2021-06-01T15:49:01Z

ref: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html
ref 2: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html#flags
ref 3: https://www.cockroachlabs.com/docs/v21.1/remove-nodes.html

udnay · 2021-06-01T16:04:58Z

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

Positive case 2 -
Annotate the CR and validate that the decommission actor doesn't run on a CR update

This should be a unit-test, I think we just need to test the handle command for the actor.

Negative case 2 -
Change node count after annotation is set, operator should throw an error

This should also be a unit-test

keith-mcclellan · 2021-06-01T16:47:52Z

Positive case 1 -
After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

You can, but it's got an interface so you can run it from anywhere that you have the database binary.

keith-mcclellan · 2021-06-01T17:49:11Z

cockroach node status --decommission --certs-dir=certs --host=<address of any live node>

 id |        address         |  build  |            started_at            |            updated_at            | is_available | is_live | gossiped_replicas | is_decommissioning | is_draining  
+---+------------------------+---------+----------------------------------+----------------------------------+--------------+---------+-------------------+--------------------+-------------+
  1 | 165.227.60.76:26257    | 91a299d | 2018-10-01 16:53:10.946245+00:00 | 2018-10-02 14:04:39.280249+00:00 |         true |  true   |                26 |       false        |    false     
  2 | 192.241.239.201:26257  | 91a299d | 2018-10-01 16:53:24.22346+00:00  | 2018-10-02 14:04:39.415235+00:00 |         true |  true   |                26 |       false        |    false     
  3 | 67.207.91.36:26257     | 91a299d | 2018-10-01 17:34:21.041926+00:00 | 2018-10-02 14:04:39.233882+00:00 |         true |  true   |                25 |       false        |    false     
  4 | 138.197.12.74:26257    | 91a299d | 2018-10-01 17:09:11.734093+00:00 | 2018-10-02 14:04:37.558204+00:00 |         true |  true   |                25 |       false        |    false     
  5 | 174.138.50.192:26257   | 91a299d | 2018-10-01 17:14:01.480725+00:00 | 2018-10-02 14:04:39.293121+00:00 |         true |  true   |                 0 |        true        |    false

This is an example of a decommissioned node that is ready to be stopped - is_decommissioning is true and gossiped_replicas = 0 means its done. We can then gracefully stop the pod.

chrislovecnm · 2021-06-01T18:15:16Z

@udnay Please document what the correct workflow is for decommision. We are getting differing opinions.

chrisseto · 2021-06-01T19:18:40Z

Weighing in on behalf of @udnay and at the request of @alinadonisa.

The logic in EnsureScale is the same logic that we use in cockroach cloud, which has been pretty well battled tested at this point. The one notable difference is that we remove Kubernetes nodes in the CC version, but that doesn't affect to core logic.

From what I can tell, that logic simply isn't running to completion or isn't running at all. Does anyone have the logs available of a failed decommission? The decommissioner is very verbose, it should be pretty easy to tell where something is going wrong based on the logs.

The PVC pruner will only remove the volumes of pods that are not currently running and have an ordinal less than the number of desired replicas. It sounds like something is changing the desired number of replicas out side of the call to EnsureScale.

It does everything that Keith has suggested sans the under replicated system check but that could be easily plugged into WaitUntilHealthy function.

udnay · 2021-06-01T20:03:35Z

Looking at the logs attached I see

{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079463,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079763,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095117,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095623,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5095706,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257

I see a failure in the decomission code due to maybe port forwarding or something because the operator isn't running in cluster ailed to start draining node 4: failed to stream execution results back: command terminated with exit code 1",

Then we start to run the deploy actor runs and I see
{"level":"info","ts":1622218735.5339012,"logger":"action","msg":"created/updated statefulset, stopping request processing","action":"deploy","CrdbCluster":"default/crdb-tls-example"} Which seems to be Lines 116-134

What does reconciler do? Can it be shutting down the extra pod after decommsion failed?

udnay · 2021-06-01T20:05:02Z

After this the logs show decommission failing because not all replicas are up, I believe @chrisseto is probably correct or at least onto something.

chrisseto · 2021-06-01T20:36:37Z

Seems like the error handling for failed decommissioning is busted? Though I'm not sure why the decommission command would fail...

keith-mcclellan · 2021-06-01T21:05:19Z

I'm not questioning that the cc drainer works properly, I'm questioning whether we implemented it properly. Something is stopping the pod before the decommission is complete...see

{"level":"warn","ts":1622218688.9847136,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0102832,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.010324,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0103347,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.010339,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.015668,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0157423,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158072,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.015832,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158622,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":4}
{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158906,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0159137,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167642,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167942,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0168018,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0354652,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0355067,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355196,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355241,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0407238,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8642936,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644145,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.8644257,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644555,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":3}
{"level":"warn","ts":1622218733.8682542,"logger":"action","msg":"operator is running inside of kubernetes, connecting to service for db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.892141,"logger":"action","msg":"opened db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.9547343,"logger":"action","msg":"established statefulset watch","action":"decommission","name":"crdb-tls-example","namespace":"default"}
{"level":"warn","ts":1622218733.9649646,"logger":"action","msg":"scaling down stateful set","action":"decommission","have":4,"want":3}
{"level":"info","ts":1622218734.250925,"logger":"action","msg":"draining node","action":"decommission","NodeID":4}
{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngh.neting.cc/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

I think the problem is the opposite to @chrisseto - I think another actor is running and that other actor is stopping the pod because it sees that nodes = 3 in the CR instead of nodes =4 and thinks it should shut it down. But because decommission is running, the decommission fails and the decommission never gets started back up. Which is then having the node showing up as failed.

If I'm reading this right, it's this that's stopping the pod that we're waiting to decommission:

{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

Decommission should be a blocking operation - ie the operator should not to any other work until the decommission is complete. And if the decommission fails we shouldn't allow the PVC pruner to run.

chrislovecnm · 2021-06-17T15:40:45Z

@udnay this needs manual testing, but removing release blocker

sabuhigr · 2022-10-05T22:34:46Z

Any updates about it? I'm still faced the same issue in v22.1.2

udnay · 2023-08-09T13:35:27Z

@prafull01 @himanshu-cockroach can one of you take a look?

prafull01 · 2023-08-09T16:51:57Z

I will take a look

prafull01 · 2023-08-14T09:56:11Z

I have tried this and on the latest version I am not able to reproduce this issue:

When you see the UI it shows 3 Live nodes and 1 Decommissioned node.

I can see the additional PVC as well on the cluster:

I have tested this on cockroach operator version v2.11.0.

keith-mcclellan added bug Something isn't working release-blocker labels May 28, 2021

alinadonisa self-assigned this May 31, 2021

This was referenced May 31, 2021

Prune after decommission of the node #553

Closed

Upgrading using cockroachDBVersion field and image in format sha256 on RELATED_IMAGE fails #552

Merged

chrislovecnm removed the release-blocker label Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] Decommission is broken #542

[Regression] Decommission is broken #542

keith-mcclellan commented May 28, 2021 •

edited

Loading

keith-mcclellan commented May 28, 2021

alinadonisa commented May 31, 2021

keith-mcclellan commented Jun 1, 2021 •

edited

Loading

keith-mcclellan commented Jun 1, 2021

udnay commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021 •

edited

Loading

chrislovecnm commented Jun 1, 2021

chrisseto commented Jun 1, 2021

udnay commented Jun 1, 2021

udnay commented Jun 1, 2021

chrisseto commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021 •

edited

Loading

chrislovecnm commented Jun 17, 2021

sabuhigr commented Oct 5, 2022

udnay commented Aug 9, 2023

prafull01 commented Aug 9, 2023

prafull01 commented Aug 14, 2023

[Regression] Decommission is broken #542

[Regression] Decommission is broken #542

Comments

keith-mcclellan commented May 28, 2021 • edited Loading

keith-mcclellan commented May 28, 2021

alinadonisa commented May 31, 2021

keith-mcclellan commented Jun 1, 2021 • edited Loading

keith-mcclellan commented Jun 1, 2021

udnay commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021 • edited Loading

chrislovecnm commented Jun 1, 2021

chrisseto commented Jun 1, 2021

udnay commented Jun 1, 2021

udnay commented Jun 1, 2021

chrisseto commented Jun 1, 2021

keith-mcclellan commented Jun 1, 2021 • edited Loading

chrislovecnm commented Jun 17, 2021

sabuhigr commented Oct 5, 2022

udnay commented Aug 9, 2023

prafull01 commented Aug 9, 2023

prafull01 commented Aug 14, 2023

keith-mcclellan commented May 28, 2021 •

edited

Loading

keith-mcclellan commented Jun 1, 2021 •

edited

Loading

keith-mcclellan commented Jun 1, 2021 •

edited

Loading

keith-mcclellan commented Jun 1, 2021 •

edited

Loading