Keystore loading behavior is inconsistent between reload and startup #43722

dimatkach · 2019-06-28T01:46:17Z

If Elasticsearch finds a key it does not recognize in keystore on start up, it refuses to start. However, if keystore is reloaded dynamically, the errors are not reported.

This is a problem in the Cloud, because a user may add a value to keystore, and then, possibly months later the cluster would start bootlooping after an unrelated change.

Making the behavior consistent in either of two ways would solve the problem: either make reload API fail and return an error if settings are invalid, or allow startup process to continue even if a key is not recognized.

My preference would be the latter option: I think there are valid use cases when a user would want to add values to keystore before configuring the rest of functionality.
For example, a domain admin might add ldap secrets to the cluster, without having to coordinate with the cluster owner to configure rest of the settings.

/cc @tvernum

elasticmachine · 2019-06-28T01:48:24Z

Pinging @elastic/es-core-infra

tvernum · 2019-06-28T02:40:48Z

allow startup process to continue even if a key is not recognized.

You will find it hard to make this case.
We have a clear preference to detect problems and fail.

If we have a realm configured in elasticsearch.yml like:

xpack.security.authc.realms:
  ldap.corp_ldap:
    url: "ldaps://some.example.com:636"
    bind_dn: "CN=elasticsearch,OU=service_accounts,DC=example,DC=com"

and then a keystore with

xpack.security.authc.realms.ldap.corp-ldap.secure_bind_password

Then the user has made a mistake. They've named their realm "corp_ldap" in the YAML and "corp-ldap" in the keystore. We want to detect that and fail fast.

We could (and in the past have) make the assumption that the user knows what they're doing, and in this case they might be planing to create a second "corp-ldap" realm in the future, but such leniency leads to far more problems than it solves.

I understand why this causes problems. From an admin point of view, there's no reasonable option to update both files (YAML and Keystore) in a way that guarantees your node can be safely restarted. If you change the yml first, and the node is restarted before you change the keystore it will fail (which is the correct outcome, but not nice from an operations point of view), and likewise if you update them in the reverse.
And obviously, as you have pointed out, reloading the keystore makes all of that more complicated.

make reload API fail and return an error if settings are invalid

I think we can make progress on this (or at least, even if we're not ready to make it a hard failure, it could report back the settings that were incorrect). However, there are 2 complexities:

(1) On our side, it's a little trickier than it sounds.
It's not too hard for us to detect that xpack.security.authc.realms.ldap.corp-ldap.secure_bind_password might be a valid keystore setting (if you have an LDAP realm named "corp-ldap") and xpack.security.authc.realms.ldap.corp-ldap.bind_dn is never a valid keystore setting (because bind_dn is not a secure setting). That's just a matter of checking whether the items in the keystore match up to Setting keys that this node knows about.
But, the more extensive validation, that (a) this realm does exist, and (b) the assigned value is meaningful will take more work. At the moment a lot of that validation is implemented as part of the process by which the components initialise themselves. Extracting it so that we can run it in isolation & validate the settings, without actually re-initialising the components will take some time, and I wouldn't want to give you any expectation that it would happen soon.

(2) On your side, I don't think this would actually do what you want.
If you to make a change that adds a new realm to elasticsearch.yml, and then add the secure settings to the keystore, and then reload the keystore we would fail because the running instance does not know about the new realm.
So, you wouldn't be able to use this API to determine whether the node was going to be able to start up cleanly, because what you really want is a way to test: Is this combination of elasticsearch.yml and elasticsearch.keystore valid for node startup, and the only option we offer for that (now, and in the near future) is to try starting the node.

However, if you were able to differentiate between keystore changes that are part of a larger "config change" (where you change both the yml and the keystore) and those that are intended to modify a running node (e.g. a keystore change to update a password) then we probably could make some progress.
If you know which of those scenarios you're in, then you could:

in the former case, treat the whole change as a unit & apply it. If the node fails to start, then roll back the whole change (both yml and keystore)
in the latter case, call the reload keystore API, and have us validate it (which we don't do now, but potentially could).

Other random thoughts

I wonder whether there's a lot of use case for the dynamically adding a setting to the keystore. Most of the cases that I can think of are for updating a value rather than adding it. Updating an existing setting is alaways a safer change because you know that the key was valid when the node started, so the only potential problem is that the value is wrong (which could still prevent node startup).

How do you deal with the fact that not all keystore settings are dynamic?
e.g. Right now you can't change the secure_bind_password on an ldap realm by reloading the keystore. So if someone changes that setting on cloud, do you detect that it needs to restart the node?

dimatkach · 2019-06-28T15:16:05Z

Then the user has made a mistake. They've named their realm "corp_ldap" in the YAML and "corp-ldap" in the keystore. We want to detect that and fail fast.

Indeed, they have, and we do!
But we don't have to do that (detect and fail) on keystore load - failing to start because configured realm corp_ldap does not have password set would do just as well if not better, and it would not create problems like the one we are discussing here.

I agree, that ES should detect and fail fast situations when configuration is invalid, just not when there is some (yet) unused key in the keystore. I am not sure I understand what is the harm in letting people add things to keystore before the rest of configuration is in place (like in my example above, it could actually be handled by different people).

I wonder whether there's a lot of use case for the dynamically adding a setting to the keystore.

Credentials for snapshot repos is one example I can think of.

So if someone changes that setting on cloud, do you detect that it needs to restart the node?

We don't (seems like that would require too much knowledge from us about every individual setting), it is up to the user to do a rolling restart after they change something that is not reloadable in ES.

tvernum · 2019-07-01T03:38:37Z

I agree, that ES should detect and fail fast situations when configuration is invalid, just not when there is some (yet) unused key in the keystore. I am not sure I understand what is the harm in letting people add things to keystore before the rest of configuration is in place (like in my example above, it could actually be handled by different people).

There is no reasonable heuristic that can tell the difference between a mistake and an intentional pre-configuration well enough to be of benefit to users.

It presumes that a typo in a setting key, will will always cause a hard failure due to the correct key being missing. In some cases this is true, but it is not always true.

Some settings are optional, it is intentional that the node will start without the setting being present. In such cases, a typo in a setting can only be detected because the incorrect setting exists, which is why we enforce that every setting has to be meaningful.

This is not just an academic problem. I have certainly seen far more cases of people putting unsupported entries into the keystore because they made a mistake (either a typo in the key, or misreading the docs and getting the key name wrong, or putting non-secure settings into the keystore) than I have seen people intentionally put settings in there that they knew would be unused (yet).

amitavmohanty01 · 2019-07-02T19:45:04Z

I think you are mixing concerns here.

The concerns at play are:

picking settings from elasticsearch process
storing secure settings in a keystore
detecting incorrect settings

The approach taken by you is that when incorrect settings are added to keystore, you fail the process startup. You think you are failing fast. You are actually violating the design boundary of keystore though. A keystore should have the capability of containing arbitrary keys. It is the process accessing it that should know how to handle them. Thus the keystore is becoming a validator.

If you want to help users detect incorrect settings, give them a script to validate which settings are correct. Here is how the design of the tools should be:

keystore: lets you store whatever keys you want to store
elasticsearch process picks up relevant keys as mentioned in config and ignores others
validation script tells users that the keys mentioned in the config YAML are missing in the keystore and the user should fix that

Putting older entries, redundant entries in a keystore is a perfectly valid scenario.

Imagine a multi-tenant keystore situation. The keys for all processes are added to the keystore for simplicity and each process picks what is necessary.

Another mistake you are doing design wise is that there are two sources of truth for the elasticsearch process. If there are two sources of truth, eventually there will always be conflict (which is kind of why we are having this discussion). To have a single source of truth, it is preferable to rely on the YAML only and refer to values in the keystore.

tvernum · 2019-07-03T02:11:51Z

The elasticsearch keystore is not, and was never intended to be a general purpose keystore.
It is specifically designed as an extension of the YAML file into a encrypted format of our own design, that is cleared from memory as soon as it is processed.

It should not be shared between processes (and cannot be because it must be stored in the same directory as the YAML).

There are other designs that we could have implemented, including the reference approach that you are advocating, however such a design would not have satisfied the constraints that we wanted for the keystore.

amitavmohanty01 · 2019-07-03T09:13:06Z

I understand that we might be coming up with suggestions based on partial knowledge and hence not being able to figure out why something makes sense for you. I am not sure if you would be at the liberty of sharing more design details.

I think the actionable points to know would be the following:

can you consider the reference based design in a future release ?
does it make sense to provide the incorrect settings detection tool again of course in a future release?

rjernst · 2019-07-12T07:19:39Z

To expand on what @tvernum has been saying here, we have long wanted to add validation the keystore cli. The way validation of all settings works (elasticsearch.yml and the keystore) is by the code internally registering every allowed setting, and launching a node validates the type and data of each setting found.

Unfortunately running this validation in the cli is difficult because plugins are allowed to add settings (and in fact much of the system, including all of xpack, is implemented through plugin apis). These plugins are loaded into separate classloaders when a node is launched, and we also are already locked down with a SecurityManager. Additionally, these plugins can do complex initialization to work correctly with the security manager. In the keystore cli tool, we do not run with a security manager, so loading the plugins is not currently an option.

The long term plan is to extract the part of plugins that expose settings into a minimal api that would not trigger loading the complex setup the full plugin initialization needs.

can you consider the reference based design in a future release ?

This was considered when we developed our keystore, but it would have complicated how the keystore settings are ready by the system, especially around validation. It is not something we plan to change.

does it make sense to provide the incorrect settings detection tool again of course in a future release?

As described above, we want the keystore cli itself to do this validation, and we very much want to add it in a future release, but there is not yet any concrete timeline for when that will happen.

rjernst · 2019-08-29T18:41:02Z

I've opened a separate issue, #46148, which details the desire to have keystore validation. I'm closing this issue in favor of that one.

tvernum added the :Core/Infra/Settings Settings infrastructure and APIs label Jun 28, 2019

tvernum added the discuss label Jun 28, 2019

jpountz added team-discuss and removed discuss labels Jun 28, 2019

rjernst closed this as completed Aug 29, 2019

waberc mentioned this issue Oct 4, 2022

resource "ec_deployment_elasticsearch_keystore" does not remove despite "will be destroyed" elastic/terraform-provider-ec#541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keystore loading behavior is inconsistent between reload and startup #43722

Keystore loading behavior is inconsistent between reload and startup #43722

dimatkach commented Jun 28, 2019

elasticmachine commented Jun 28, 2019

tvernum commented Jun 28, 2019

dimatkach commented Jun 28, 2019 •

edited

Loading

tvernum commented Jul 1, 2019

amitavmohanty01 commented Jul 2, 2019

tvernum commented Jul 3, 2019

amitavmohanty01 commented Jul 3, 2019

rjernst commented Jul 12, 2019

rjernst commented Aug 29, 2019

Keystore loading behavior is inconsistent between reload and startup #43722

Keystore loading behavior is inconsistent between reload and startup #43722

Comments

dimatkach commented Jun 28, 2019

elasticmachine commented Jun 28, 2019

tvernum commented Jun 28, 2019

Other random thoughts

dimatkach commented Jun 28, 2019 • edited Loading

tvernum commented Jul 1, 2019

amitavmohanty01 commented Jul 2, 2019

tvernum commented Jul 3, 2019

amitavmohanty01 commented Jul 3, 2019

rjernst commented Jul 12, 2019

rjernst commented Aug 29, 2019

dimatkach commented Jun 28, 2019 •

edited

Loading