Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

tnozicka · 2024-08-16T14:55:35Z

Description of your changes:
This PR extends the ScyllaOperatorConfig controller to properly report status and the images that are used and wires the rest of the controllers to consume its status. It also wires unsupported overrides for the images used which can be used for testing or temporarily to help disconnected installs. (Users are responsible to make sure it's the same image version when touching the settings.)

It also extracts all image and version settings to a common place in assets/config/config.yaml.

The only place left to manually update versions is the documentation. Maybe in the future we can let it include real examples and sync the versions there automatically as well.

Which issue is resolved by this Pull Request:
Resolves #1589 #2071

Requires

Collect, snapshot and gracefully restore global objects in e2e test #2091

scylla-operator-bot · 2024-08-16T14:55:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tnozicka]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tnozicka · 2024-08-30T06:32:52Z

@zimnx @rzetelskik rebased to pick up the prerequisites and ready, ptal

examples/eks/cluster.yaml

pkg/api/scylla/validation/scyllaoperatorconfig_test.go

pkg/api/scylla/v1alpha1/types_operatorconfig.go

rzetelskik · 2024-08-30T09:44:28Z

This place should probably also take the version from config:

scylla-operator/test/e2e/set/scyllacluster/scyllacluster_replace.go

Line 173 in 80c490c

scyllaVersion: "5.2.6",

We also have two (that I found) tests that are using the enterprise image:

scylla-operator/test/e2e/set/scyllacluster/scyllacluster_replace.go

Line 178 in 80c490c

scyllaVersion: "2023.1.0",
scylla-operator/test/e2e/set/scyllacluster/scyllamanager_object_storage.go

Line 458 in 80c490c

scyllaVersion: "2024.1.7",

2 is somewhat specific because it needs a particular minor version, but did you consider adding a config field for a general enterprise version (I suppose in operatorTests)? Or is the general plan to have a separate suite for enterprise, in which case the ScyllaDBVersion field would be enough?

tnozicka · 2024-08-30T10:03:44Z

"5.2.6"

Looks like that was being left out in the past quite often, I'll wire it and update

enterprise

I think our goal should be to have a second version of the suite and run all the tests with enterprise (using different / replaced config.yaml)

zimnx · 2024-08-30T10:34:34Z

pkg/api/scylla/v1alpha1/types_operatorconfig.go

+	// for auxiliary purposes.
+	// Setting this field renders your cluster unsupported. Use at your own risk.
+	// +optional
+	UnsupportedBashToolsImageOverride *string `json:"unsupportedBashToolsImageOverride,omitempty"`


Unsupported prefix is confusing. In the past, users of fields prefixed with it (unsupportedOptions in NodeConfig mounts) didn't understand whether this field works, what it does, because of this prefix. I think it's a mistake to continue to use it.

I think it's enough to make it explicit in the field description, that using it should be restricted to advanced users knowing that they are doing. We can explain consequences of changing it there.
Looking at Kubernetes resources, there're none having this confusing prefix, yet there're field which users shouldn't set (Job selector for example).

users don't read the docs and the field name is the only thing that they see so I'd stick with it. it's very much the same like UnsupportedScyllaDBArgs

Then why we are getting complains about quality of documentation? They do, and it's the right place to write longer descriptions.

We don't have UnsupportedScyllaDBArgs in new API, we have AdditionalScyllaDBArgs + description.

Then why we are getting complains about quality of documentation? They do

This is a probabilistic case. Just a few users/customers, who don't, are enough and manage to build something on top of the unsupported options and you'll get all the fires when it finally breaks no matter the docs said unsupported. At OpenShift, we have been there, see e.g.
https://github.com/openshift/api/blob/01b3675ba7b364e312ac5da6e632251fbaefdda0/operator/v1/types_ingress.go#L254
https://github.com/search?q=repo%3Aopenshift%2Fapi+unsupported&type=code&p=2
It's better to be safe then sorry. I don't intent to get paged over that.

I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them. That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.

I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them.

I think this is more about having a stronger case for not carting about such cases than it is about preventing it entirely. I do however agree with @zimnx that Unsupported prefix is confusing, I myself didn't understand what unsupportedOptions in NodeConfig were supposed to stand for at first, but that's something that actually encourages you to go and check, instead of just using it. Perhaps we could find a better prefix, but I don't have a better idea myself. Between unsupported and dropping the prefix altogether, I'd go with the former.

That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.

It does, the prefix is meant for people who don't reach the docs. I have a hard time undestanding how enhancing the docs helps with the group of people who don't read it.

We shouldn't optimize for slackers, it's entirely their fault they don't read the documentation.
I can imagine someone trying to deploy 1.0 Scylla without checking supported versions - part of the documentation. Following your logic, shouldn't we add confusing prefix to version field to encourage them to read the docs?

Sounds like a security by obscurity but for API.

Without any new input I am afraid we have to agree to disagree on this one.

pkg/controller/scylladbmonitoring/controller.go

tnozicka · 2024-08-30T16:20:04Z

#2061
/retest

rzetelskik · 2024-09-02T08:31:41Z

We also have two (that I found) tests that are using the enterprise image:

scylla-operator/test/e2e/set/scyllacluster/scyllamanager_object_storage.go

Line 458 in 80c490c

scyllaVersion: "2024.1.7",

I think our goal should be to have a second version of the suite and run all the tests with enterprise (using different / replaced config.yaml)

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

Other than that lgtm, no further comments.
/assign zimnx

tnozicka · 2024-09-02T09:29:34Z

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

I think this sucks both ways as this shouldn't be in the config, if we want to switch all regular versions in the enterprise but I guess to address the concerns for the interim, I can put the there with a notion of removing it soon

tnozicka · 2024-09-02T09:50:15Z

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

I think this sucks both ways as this shouldn't be in the config, if we want to switch all regular versions in the enterprise but I guess to address the concerns for the interim, I can put the there with a notion of removing it soon

done

pkg/controller/nodeconfig/controller.go

assets/config/config.go

zimnx · 2024-09-02T14:16:58Z

pkg/controller/nodeconfig/sync_daemonsets.go

-	// FIXME: check that its not empty, emit event
-	// FIXME: add webhook validation for the format
+	if soc.Status.ScyllaDBUtilsImage == nil || len(*soc.Status.ScyllaDBUtilsImage) == 0 {
+		ncc.eventRecorder.Event(nc, corev1.EventTypeNormal, "MissingScyllaUtilsImage", "ScyllaOperatorConfig doesn't yet have scyllaUtilsImage available in the status.")


shouldn't this result in Progressing condition instead? It's expected to be available eventually.

I think for a different controller (NCC vs. SOC) this is a degraded state - it doesn't know whether the other controller can eventually make it, or is failing.
Effectively, it should be there from the operator start time, so emitting an event seems adequate.

I added comment to wrong line, I meant to add Progressing condition instead returning an error which results in Degraded state.
I wouldn't call it degraded because you don't know if anything errored out, maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so. In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.

maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so

Right after you install the operator, this field should get set on the first sync loop of SOC. Or this one upgrade. If it's missing on other cases, something is broken - not a stale cache.

In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.

That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers (also without any parent / child relationship).

A comparable case would be to say wanting to read a kube-root-ca.crt provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)

I guess it's also a thin boundary so I can see a bit of point going both ways here.

That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers

What I meant for example, were cleanup Jobs waiting for HostID annotation reconciled by sidecar controller. These are two independent controllers.

(also without any parent / child relationship).

I don't see how relationship matters. Nevertheless, ScyllaOperatorConfig is indirect parent of every resource reconciled by Operator, so they are related. Especially NodeConfigs, as since this PR SOC is part of its outputs.

A comparable case would be to say wanting to read a kube-root-ca.crt provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)

We don't do anything when tuning is running, nor we create cleanup jobs when HostID annotation is missing, yet we don't consider ScyllaCluster degraded but progressing.

There's artifical delay between when objects are created/updated when they need to be synchronized. That is expected and it's not an error. In your case ConfigMap has a Namespace dependency so not having it immediately after Namespace creation is expected.

pkg/controller/nodeconfig/sync_daemonsets.go

test/e2e/set/scyllaoperatorconfig/config.go

to a common config

zimnx · 2024-09-02T17:06:18Z

Comment threads where we don't agree are not that important to block the release.
/lgtm

tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 16, 2024

scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/dependency Issues or PRs related to dependency changes labels Aug 16, 2024

scylla-operator-bot bot requested review from rzetelskik and zimnx August 16, 2024 14:55

scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2024

tnozicka force-pushed the image-config branch 10 times, most recently from b5b6f2d to 173d439 Compare August 22, 2024 10:12

scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 23, 2024

tnozicka force-pushed the image-config branch 3 times, most recently from 2a8ef53 to f1286c9 Compare August 28, 2024 18:02

scylla-operator-bot bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 28, 2024

Fix make test-unit to wire test files everywhere

b79e8de

tnozicka force-pushed the image-config branch from f1286c9 to 32e1f01 Compare August 30, 2024 06:09

scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2024

tnozicka changed the title ~~[WIP] Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config~~ Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config Aug 30, 2024

scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2024

tnozicka mentioned this pull request Aug 30, 2024

Update ScyllaDB versions to 6.1.1 and 2024.1.8 #2103

Merged

rzetelskik reviewed Aug 30, 2024

View reviewed changes

examples/eks/cluster.yaml Show resolved Hide resolved

rzetelskik reviewed Aug 30, 2024

View reviewed changes

tnozicka force-pushed the image-config branch from 32e1f01 to 0ee442a Compare August 30, 2024 10:14

zimnx reviewed Aug 30, 2024

View reviewed changes

pkg/controller/scylladbmonitoring/controller.go Outdated Show resolved Hide resolved

tnozicka force-pushed the image-config branch from 0ee442a to d1c707d Compare August 30, 2024 12:47

scylla-operator-bot bot assigned zimnx Sep 2, 2024

tnozicka force-pushed the image-config branch from d1c707d to e33d4fc Compare September 2, 2024 09:50

zimnx reviewed Sep 2, 2024

View reviewed changes

pkg/controller/nodeconfig/controller.go Outdated Show resolved Hide resolved

rzetelskik reviewed Sep 2, 2024

View reviewed changes

assets/config/config.go Show resolved Hide resolved

tnozicka force-pushed the image-config branch from e33d4fc to cdbfa35 Compare September 2, 2024 12:55

zimnx reviewed Sep 2, 2024

View reviewed changes

tnozicka force-pushed the image-config branch from cdbfa35 to a5941ff Compare September 2, 2024 15:47

tnozicka added 3 commits September 2, 2024 18:27

Manage global configs on with ScyllaOperatorConfig and extract defaults

4bbd1c4

to a common config

Update dependencies

702f4ea

Update generated

4b932db

tnozicka force-pushed the image-config branch from a5941ff to 4b932db Compare September 2, 2024 16:27

scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2024

scylla-operator-bot bot merged commit 878aa17 into scylladb:master Sep 2, 2024
13 of 14 checks passed

tnozicka deleted the image-config branch September 3, 2024 12:31

tnozicka mentioned this pull request Sep 3, 2024

Allow overriding embedded image references for testing #2071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

tnozicka commented Aug 16, 2024 •

edited

Loading

scylla-operator-bot bot commented Aug 16, 2024

tnozicka commented Aug 30, 2024 •

edited

Loading

rzetelskik commented Aug 30, 2024 •

edited

Loading

tnozicka commented Aug 30, 2024

zimnx Aug 30, 2024 •

edited

Loading

tnozicka Aug 30, 2024

zimnx Aug 30, 2024

tnozicka Aug 30, 2024

zimnx Aug 30, 2024

rzetelskik Aug 30, 2024

tnozicka Aug 30, 2024 •

edited

Loading

zimnx Sep 2, 2024

tnozicka Sep 2, 2024

tnozicka commented Aug 30, 2024

rzetelskik commented Sep 2, 2024

tnozicka commented Sep 2, 2024

tnozicka commented Sep 2, 2024

zimnx Sep 2, 2024

tnozicka Sep 2, 2024

zimnx Sep 2, 2024

tnozicka Sep 2, 2024 •

edited

Loading

tnozicka Sep 2, 2024

zimnx Sep 2, 2024 •

edited

Loading

zimnx commented Sep 2, 2024

Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

Conversation

tnozicka commented Aug 16, 2024 • edited Loading

Requires

scylla-operator-bot bot commented Aug 16, 2024

tnozicka commented Aug 30, 2024 • edited Loading

rzetelskik commented Aug 30, 2024 • edited Loading

tnozicka commented Aug 30, 2024

zimnx Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka commented Aug 30, 2024

rzetelskik commented Sep 2, 2024

tnozicka commented Sep 2, 2024

tnozicka commented Sep 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zimnx Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

zimnx commented Sep 2, 2024

tnozicka commented Aug 16, 2024 •

edited

Loading

tnozicka commented Aug 30, 2024 •

edited

Loading

rzetelskik commented Aug 30, 2024 •

edited

Loading

zimnx Aug 30, 2024 •

edited

Loading

tnozicka Aug 30, 2024 •

edited

Loading

tnozicka Sep 2, 2024 •

edited

Loading

zimnx Sep 2, 2024 •

edited

Loading