-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop automating cleanup operations by default when scaling #496
Comments
Just some small comments here. First of all, this has been discussed earlier and it was decided to keep it as the default, since it should be done and it boils down to this:
This is the correct behavior. One should not rescale if the cleanup has not been finished. Scaling up is a non-trivial operation which takes time and it should be done correctly. Otherwise if this operation is forgotten, next someone will complain that compactions do not work anymore since they ran out of diskspace despite increasing the amount of nodes. If the ops do not know that scale up takes time, perhaps they also don't know that cleanup needs to be done. We should strive to maximize the automation. Why is someone trying to scale up if they don't have the capability of doing it? Why can't they make that scale up happen when the cluster is able to do it? If the cleanup doesn't succeed, then the scale up hasn't succeeded. Why is that suddenly a wrong status? I think that's exactly the correct status - scale up has failed and user should not do more operations to mess up their cluster before handling the issue. I did originally propose the opt-out possibility, but I was told that's not necessary (by the same party where this request now comes from to change the default for everyone). As for this:
Why on earth would they do that? cass-operator takes care of it that way, just add the final number of nodes and it will scale one by one. |
Hi. Sorry for bringing up this old topic, but I believe it's more appropriate than creating a new one since my questions are pretty much related.
|
For the first one, you can disable the creation of automated cleanup. Add to the CassandraDatacenter an annotation: Right now, deleting a CassandraTask will not stop the current targeted pod from completing the cleanup (or any other process) as there's not necessarily any cancel process inside Cassandra (for some processes there are, but not all). But it would stop targeting the next pod. Implementing a stop/cancel of the running pod should be done as a separate ticket so it could be scoped when it's possible and when it isn't. It would require changes in the management-api as well to provide such endpoints. |
Thanks for such a fast and detailed reply.
I'd call it an expected behaviour and I'm definitely not asking you to change it. Thanks for the clarification. |
Currently cass-operator triggers a cleanup CassandraTask after each scale operation (by that, I mean after reaching the desired number of replicas, not after each node addition).
Cleanup is a non trivial operation that can take a lot of time, and which impacts the performance and available disk space of nodes in a cluster.
The automation behind this also impacts the CassandraDatacenter progress which will be ready only once the cleanup operation is finished (which can take hours or even days), preventing any other scale operation (and possibly any update to the cassdc at all?).
This can catch ops off guard as they perform scale operations in multiple edits to the cassdc object (adding one replica, waiting for the expansion to complete and adding one or more replicas again as soon as the first expansion is done).
I think we should stop running cleanups by default, and make the automation an opt-in setting.
Cleanups should, by default, be performed using CassandraTasks created by users/operators themselves, which can allow fine tuning the level of concurrency of the operation as well (and possibly, if we add that possibility, setting the number of compactors used by the operation).
The text was updated successfully, but these errors were encountered: