Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081

Merged
merged 4 commits into from
Sep 2, 2024

Conversation

tnozicka
Copy link
Member

@tnozicka tnozicka commented Aug 16, 2024

Description of your changes:
This PR extends the ScyllaOperatorConfig controller to properly report status and the images that are used and wires the rest of the controllers to consume its status. It also wires unsupported overrides for the images used which can be used for testing or temporarily to help disconnected installs. (Users are responsible to make sure it's the same image version when touching the settings.)

It also extracts all image and version settings to a common place in assets/config/config.yaml.

The only place left to manually update versions is the documentation. Maybe in the future we can let it include real examples and sync the versions there automatically as well.

Which issue is resolved by this Pull Request:
Resolves #1589 #2071

Requires

@tnozicka tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 16, 2024
@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/dependency Issues or PRs related to dependency changes labels Aug 16, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2024
@tnozicka tnozicka force-pushed the image-config branch 10 times, most recently from b5b6f2d to 173d439 Compare August 22, 2024 10:12
@scylla-operator-bot scylla-operator-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 23, 2024
@tnozicka tnozicka force-pushed the image-config branch 3 times, most recently from 2a8ef53 to f1286c9 Compare August 28, 2024 18:02
@scylla-operator-bot scylla-operator-bot bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 28, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2024
@tnozicka tnozicka changed the title [WIP] Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config Aug 30, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2024
@tnozicka
Copy link
Member Author

tnozicka commented Aug 30, 2024

@zimnx @rzetelskik rebased to pick up the prerequisites and ready, ptal

@rzetelskik
Copy link
Member

rzetelskik commented Aug 30, 2024

This place should probably also take the version from config:

We also have two (that I found) tests that are using the enterprise image:

2 is somewhat specific because it needs a particular minor version, but did you consider adding a config field for a general enterprise version (I suppose in operatorTests)? Or is the general plan to have a separate suite for enterprise, in which case the ScyllaDBVersion field would be enough?

@tnozicka
Copy link
Member Author

"5.2.6"

Looks like that was being left out in the past quite often, I'll wire it and update

enterprise

I think our goal should be to have a second version of the suite and run all the tests with enterprise (using different / replaced config.yaml)

// for auxiliary purposes.
// Setting this field renders your cluster unsupported. Use at your own risk.
// +optional
UnsupportedBashToolsImageOverride *string `json:"unsupportedBashToolsImageOverride,omitempty"`
Copy link
Collaborator

@zimnx zimnx Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsupported prefix is confusing. In the past, users of fields prefixed with it (unsupportedOptions in NodeConfig mounts) didn't understand whether this field works, what it does, because of this prefix. I think it's a mistake to continue to use it.

I think it's enough to make it explicit in the field description, that using it should be restricted to advanced users knowing that they are doing. We can explain consequences of changing it there.
Looking at Kubernetes resources, there're none having this confusing prefix, yet there're field which users shouldn't set (Job selector for example).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users don't read the docs and the field name is the only thing that they see so I'd stick with it. it's very much the same like UnsupportedScyllaDBArgs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why we are getting complains about quality of documentation? They do, and it's the right place to write longer descriptions.

We don't have UnsupportedScyllaDBArgs in new API, we have AdditionalScyllaDBArgs + description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why we are getting complains about quality of documentation? They do

This is a probabilistic case. Just a few users/customers, who don't, are enough and manage to build something on top of the unsupported options and you'll get all the fires when it finally breaks no matter the docs said unsupported. At OpenShift, we have been there, see e.g.
https://github.com/openshift/api/blob/01b3675ba7b364e312ac5da6e632251fbaefdda0/operator/v1/types_ingress.go#L254
https://github.com/search?q=repo%3Aopenshift%2Fapi+unsupported&type=code&p=2
It's better to be safe then sorry. I don't intent to get paged over that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them. That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them.

I think this is more about having a stronger case for not carting about such cases than it is about preventing it entirely. I do however agree with @zimnx that Unsupported prefix is confusing, I myself didn't understand what unsupportedOptions in NodeConfig were supposed to stand for at first, but that's something that actually encourages you to go and check, instead of just using it. Perhaps we could find a better prefix, but I don't have a better idea myself. Between unsupported and dropping the prefix altogether, I'd go with the former.

Copy link
Member Author

@tnozicka tnozicka Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.

It does, the prefix is meant for people who don't reach the docs. I have a hard time undestanding how enhancing the docs helps with the group of people who don't read it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't optimize for slackers, it's entirely their fault they don't read the documentation.
I can imagine someone trying to deploy 1.0 Scylla without checking supported versions - part of the documentation. Following your logic, shouldn't we add confusing prefix to version field to encourage them to read the docs?

Sounds like a security by obscurity but for API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any new input I am afraid we have to agree to disagree on this one.

@tnozicka
Copy link
Member Author

#2061
/retest

@rzetelskik
Copy link
Member

We also have two (that I found) tests that are using the enterprise image:

I think our goal should be to have a second version of the suite and run all the tests with enterprise (using different / replaced config.yaml)

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

Other than that lgtm, no further comments.
/assign zimnx

@tnozicka
Copy link
Member Author

tnozicka commented Sep 2, 2024

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

I think this sucks both ways as this shouldn't be in the config, if we want to switch all regular versions in the enterprise but I guess to address the concerns for the interim, I can put the there with a notion of removing it soon

@tnozicka
Copy link
Member Author

tnozicka commented Sep 2, 2024

So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this.

I think this sucks both ways as this shouldn't be in the config, if we want to switch all regular versions in the enterprise but I guess to address the concerns for the interim, I can put the there with a notion of removing it soon

done

// FIXME: check that its not empty, emit event
// FIXME: add webhook validation for the format
if soc.Status.ScyllaDBUtilsImage == nil || len(*soc.Status.ScyllaDBUtilsImage) == 0 {
ncc.eventRecorder.Event(nc, corev1.EventTypeNormal, "MissingScyllaUtilsImage", "ScyllaOperatorConfig doesn't yet have scyllaUtilsImage available in the status.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this result in Progressing condition instead? It's expected to be available eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a different controller (NCC vs. SOC) this is a degraded state - it doesn't know whether the other controller can eventually make it, or is failing.
Effectively, it should be there from the operator start time, so emitting an event seems adequate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comment to wrong line, I meant to add Progressing condition instead returning an error which results in Degraded state.
I wouldn't call it degraded because you don't know if anything errored out, maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so. In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.

Copy link
Member Author

@tnozicka tnozicka Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so

Right after you install the operator, this field should get set on the first sync loop of SOC. Or this one upgrade. If it's missing on other cases, something is broken - not a stale cache.

In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.

That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers (also without any parent / child relationship).

A comparable case would be to say wanting to read a kube-root-ca.crt provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's also a thin boundary so I can see a bit of point going both ways here.

Copy link
Collaborator

@zimnx zimnx Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers

What I meant for example, were cleanup Jobs waiting for HostID annotation reconciled by sidecar controller. These are two independent controllers.

(also without any parent / child relationship).

I don't see how relationship matters. Nevertheless, ScyllaOperatorConfig is indirect parent of every resource reconciled by Operator, so they are related. Especially NodeConfigs, as since this PR SOC is part of its outputs.

A comparable case would be to say wanting to read a kube-root-ca.crt provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)

We don't do anything when tuning is running, nor we create cleanup jobs when HostID annotation is missing, yet we don't consider ScyllaCluster degraded but progressing.

There's artifical delay between when objects are created/updated when they need to be synchronized. That is expected and it's not an error. In your case ConfigMap has a Namespace dependency so not having it immediately after Namespace creation is expected.

pkg/controller/nodeconfig/sync_daemonsets.go Show resolved Hide resolved
test/e2e/set/scyllaoperatorconfig/config.go Outdated Show resolved Hide resolved
@zimnx
Copy link
Collaborator

zimnx commented Sep 2, 2024

Comment threads where we don't agree are not that important to block the release.
/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2024
@scylla-operator-bot scylla-operator-bot bot merged commit 878aa17 into scylladb:master Sep 2, 2024
13 of 14 checks passed
@tnozicka tnozicka deleted the image-config branch September 3, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dependency Issues or PRs related to dependency changes kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configuration file to specify versions and images so it's easy to update
3 participants