Stop automating cleanup operations by default when scaling #496

adejanovski · 2023-02-15T14:23:36Z

Currently cass-operator triggers a cleanup CassandraTask after each scale operation (by that, I mean after reaching the desired number of replicas, not after each node addition).

Cleanup is a non trivial operation that can take a lot of time, and which impacts the performance and available disk space of nodes in a cluster.
The automation behind this also impacts the CassandraDatacenter progress which will be ready only once the cleanup operation is finished (which can take hours or even days), preventing any other scale operation (and possibly any update to the cassdc at all?).
This can catch ops off guard as they perform scale operations in multiple edits to the cassdc object (adding one replica, waiting for the expansion to complete and adding one or more replicas again as soon as the first expansion is done).

I think we should stop running cleanups by default, and make the automation an opt-in setting.
Cleanups should, by default, be performed using CassandraTasks created by users/operators themselves, which can allow fine tuning the level of concurrency of the operation as well (and possibly, if we add that possibility, setting the number of compactors used by the operation).

burmanm · 2023-02-15T14:56:36Z

Just some small comments here. First of all, this has been discussed earlier and it was decided to keep it as the default, since it should be done and it boils down to this:

The automation behind this also impacts the CassandraDatacenter progress which will be ready only once the cleanup operation is finished (which can take hours or even days), preventing any other scale operation (and possibly any update to the cassdc at all?).

This is the correct behavior. One should not rescale if the cleanup has not been finished. Scaling up is a non-trivial operation which takes time and it should be done correctly. Otherwise if this operation is forgotten, next someone will complain that compactions do not work anymore since they ran out of diskspace despite increasing the amount of nodes.

If the ops do not know that scale up takes time, perhaps they also don't know that cleanup needs to be done. We should strive to maximize the automation. Why is someone trying to scale up if they don't have the capability of doing it? Why can't they make that scale up happen when the cluster is able to do it?

If the cleanup doesn't succeed, then the scale up hasn't succeeded. Why is that suddenly a wrong status? I think that's exactly the correct status - scale up has failed and user should not do more operations to mess up their cluster before handling the issue.

I did originally propose the opt-out possibility, but I was told that's not necessary (by the same party where this request now comes from to change the default for everyone).

As for this:

This can catch ops off guard as they perform scale operations in multiple edits to the cassdc object (adding one replica, waiting for the expansion to complete and adding one or more replicas again as soon as the first expansion is done).

Why on earth would they do that? cass-operator takes care of it that way, just add the final number of nodes and it will scale one by one.

alexandrpaliy · 2024-07-10T11:04:01Z

Hi. Sorry for bringing up this old topic, but I believe it's more appropriate than creating a new one since my questions are pretty much related.

The OP's suggestion was to "Stop automating cleanup operations by default ...", and I disagree with him on this part. I acknowledge the counter-arguments mentioned, and I have nothing against them as long as we are talking about "what should happed by default". But since some users (myself included) would like to have a way to prevent automatic cleanup in some cases (and it doesn't matter is it the most correct approach or not) - won't it make sense to make the auto-cleanup feature enabled by default, but still optional? Or am I missing something and this feature is present already?
No matter will the feature become optional or not, right now is there any way to stop an already running cleanup? According to https://docs.k8ssandra.io/tasks/cluster-tasks/ , I don't see any mentions of stopping/disabling an already existing cluster-level task. I haven't tested it myself [yet], but what happens if I:

simply delete an existing cassandratask via kubectl?
manually run nodetool stop -- CLEANUP on the node which currently runs a cleanup process (while "cassandratask" is still present)
I assume, for both cases - the answers are "unsupported feature" and "undefined behaviour"? :)

burmanm · 2024-07-10T12:48:04Z

For the first one, you can disable the creation of automated cleanup. Add to the CassandraDatacenter an annotation: cassandra.datastax.com/no-cleanup: "true"

Right now, deleting a CassandraTask will not stop the current targeted pod from completing the cleanup (or any other process) as there's not necessarily any cancel process inside Cassandra (for some processes there are, but not all). But it would stop targeting the next pod.

Implementing a stop/cancel of the running pod should be done as a separate ticket so it could be scoped when it's possible and when it isn't. It would require changes in the management-api as well to provide such endpoints.

alexandrpaliy · 2024-07-10T13:13:11Z

Thanks for such a fast and detailed reply.

Right now, deleting a CassandraTask will not stop the current targeted pod from completing the cleanup (or any other process) as there's not necessarily any cancel process inside Cassandra (for some processes there are, but not all). But it would stop targeting the next pod.

I'd call it an expected behaviour and I'm definitely not asking you to change it. Thanks for the clarification.

burmanm mentioned this issue Mar 8, 2023

Do not track the cleanup task status anymore in CassandraDatacenter #502

Merged

5 tasks

burmanm closed this as completed in #502 Mar 9, 2023

adejanovski added the done Issues in the state 'done' label Mar 9, 2023

burmanm mentioned this issue Oct 25, 2024

Add back the ability to track cleanup after scale up #722

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop automating cleanup operations by default when scaling #496

Stop automating cleanup operations by default when scaling #496

adejanovski commented Feb 15, 2023

burmanm commented Feb 15, 2023

alexandrpaliy commented Jul 10, 2024

burmanm commented Jul 10, 2024

alexandrpaliy commented Jul 10, 2024

Stop automating cleanup operations by default when scaling #496

Stop automating cleanup operations by default when scaling #496

Comments

adejanovski commented Feb 15, 2023

burmanm commented Feb 15, 2023

alexandrpaliy commented Jul 10, 2024

burmanm commented Jul 10, 2024

alexandrpaliy commented Jul 10, 2024