-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage global configs using ScyllaOperatorConfig, add unsupported overrides and extract defaults to a common config #2081
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tnozicka The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b5b6f2d
to
173d439
Compare
2a8ef53
to
f1286c9
Compare
@zimnx @rzetelskik rebased to pick up the prerequisites and ready, ptal |
This place should probably also take the version from config:
We also have two (that I found) tests that are using the enterprise image:
2 is somewhat specific because it needs a particular minor version, but did you consider adding a config field for a general enterprise version (I suppose in |
Looks like that was being left out in the past quite often, I'll wire it and update
I think our goal should be to have a second version of the suite and run all the tests with enterprise (using different / replaced config.yaml) |
// for auxiliary purposes. | ||
// Setting this field renders your cluster unsupported. Use at your own risk. | ||
// +optional | ||
UnsupportedBashToolsImageOverride *string `json:"unsupportedBashToolsImageOverride,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsupported
prefix is confusing. In the past, users of fields prefixed with it (unsupportedOptions in NodeConfig mounts) didn't understand whether this field works, what it does, because of this prefix. I think it's a mistake to continue to use it.
I think it's enough to make it explicit in the field description, that using it should be restricted to advanced users knowing that they are doing. We can explain consequences of changing it there.
Looking at Kubernetes resources, there're none having this confusing prefix, yet there're field which users shouldn't set (Job selector for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users don't read the docs and the field name is the only thing that they see so I'd stick with it. it's very much the same like UnsupportedScyllaDBArgs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why we are getting complains about quality of documentation? They do, and it's the right place to write longer descriptions.
We don't have UnsupportedScyllaDBArgs in new API, we have AdditionalScyllaDBArgs + description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why we are getting complains about quality of documentation? They do
This is a probabilistic case. Just a few users/customers, who don't, are enough and manage to build something on top of the unsupported options and you'll get all the fires when it finally breaks no matter the docs said unsupported. At OpenShift, we have been there, see e.g.
https://github.com/openshift/api/blob/01b3675ba7b364e312ac5da6e632251fbaefdda0/operator/v1/types_ingress.go#L254
https://github.com/search?q=repo%3Aopenshift%2Fapi+unsupported&type=code&p=2
It's better to be safe then sorry. I don't intent to get paged over that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them. That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure they will build on top of it anyway. Having confusing prefix or not is not going to stop them.
I think this is more about having a stronger case for not carting about such cases than it is about preventing it entirely. I do however agree with @zimnx that Unsupported
prefix is confusing, I myself didn't understand what unsupportedOptions
in NodeConfig were supposed to stand for at first, but that's something that actually encourages you to go and check, instead of just using it. Perhaps we could find a better prefix, but I don't have a better idea myself. Between unsupported
and dropping the prefix altogether, I'd go with the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we should improve documentation and user understanding. Having confusing prefix doesn't help with that.
It does, the prefix is meant for people who don't reach the docs. I have a hard time undestanding how enhancing the docs helps with the group of people who don't read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't optimize for slackers, it's entirely their fault they don't read the documentation.
I can imagine someone trying to deploy 1.0 Scylla without checking supported versions - part of the documentation. Following your logic, shouldn't we add confusing prefix to version
field to encourage them to read the docs?
Sounds like a security by obscurity but for API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without any new input I am afraid we have to agree to disagree on this one.
#2061 |
So what about this one? It's a special case that won't be dependent on the general version being tested - should it stay defined in the test itself? Imo leaving it there, as opposed to moving to the test config, makes it more prone to being forgotten about, but I don't feel strong about this. Other than that lgtm, no further comments. |
I think this sucks both ways as this shouldn't be in the config, if we want to switch all regular versions in the enterprise but I guess to address the concerns for the interim, I can put the there with a notion of removing it soon |
done |
d1c707d
to
e33d4fc
Compare
e33d4fc
to
cdbfa35
Compare
// FIXME: check that its not empty, emit event | ||
// FIXME: add webhook validation for the format | ||
if soc.Status.ScyllaDBUtilsImage == nil || len(*soc.Status.ScyllaDBUtilsImage) == 0 { | ||
ncc.eventRecorder.Event(nc, corev1.EventTypeNormal, "MissingScyllaUtilsImage", "ScyllaOperatorConfig doesn't yet have scyllaUtilsImage available in the status.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this result in Progressing condition instead? It's expected to be available eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for a different controller (NCC vs. SOC) this is a degraded state - it doesn't know whether the other controller can eventually make it, or is failing.
Effectively, it should be there from the operator start time, so emitting an event seems adequate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comment to wrong line, I meant to add Progressing condition instead returning an error which results in Degraded state.
I wouldn't call it degraded because you don't know if anything errored out, maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so. In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's just stale cache. Is stale cache a cause of object being degraded, I don't think so
Right after you install the operator, this field should get set on the first sync loop of SOC. Or this one upgrade. If it's missing on other cases, something is broken - not a stale cache.
In similar cases, our existing controllers are waiting with Progressing conditiong having set so this is inconsistent behavior.
That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers (also without any parent / child relationship).
A comparable case would be to say wanting to read a kube-root-ca.crt
provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's also a thin boundary so I can see a bit of point going both ways here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not exactly the same - the case you provide w.r.t. our other resources is within a single controller and dependent objects while these are 2 independent controllers
What I meant for example, were cleanup Jobs waiting for HostID annotation reconciled by sidecar controller. These are two independent controllers.
(also without any parent / child relationship).
I don't see how relationship matters. Nevertheless, ScyllaOperatorConfig is indirect parent of every resource reconciled by Operator, so they are related. Especially NodeConfigs, as since this PR SOC is part of its outputs.
A comparable case would be to say wanting to read a kube-root-ca.crt provisioned by Kubernetes controllers. If it's not there, you are Degraded, not Progressing as you are not making any progress to fixing that state (either directly or indirectly). (And you don't know whether the Kubernetes controllers are Progressing toward the state, or Degraded.)
We don't do anything when tuning is running, nor we create cleanup jobs when HostID annotation is missing, yet we don't consider ScyllaCluster degraded but progressing.
There's artifical delay between when objects are created/updated when they need to be synchronized. That is expected and it's not an error. In your case ConfigMap has a Namespace dependency so not having it immediately after Namespace creation is expected.
cdbfa35
to
a5941ff
Compare
a5941ff
to
4b932db
Compare
Comment threads where we don't agree are not that important to block the release. |
878aa17
into
scylladb:master
Description of your changes:
This PR extends the ScyllaOperatorConfig controller to properly report status and the images that are used and wires the rest of the controllers to consume its status. It also wires unsupported overrides for the images used which can be used for testing or temporarily to help disconnected installs. (Users are responsible to make sure it's the same image version when touching the settings.)
It also extracts all image and version settings to a common place in
assets/config/config.yaml
.The only place left to manually update versions is the documentation. Maybe in the future we can let it include real examples and sync the versions there automatically as well.
Which issue is resolved by this Pull Request:
Resolves #1589 #2071
Requires