Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS certs may not be picked up if updated at ES startup #54867

Closed
sebgl opened this issue Apr 7, 2020 · 3 comments · Fixed by #54999
Closed

TLS certs may not be picked up if updated at ES startup #54867

sebgl opened this issue Apr 7, 2020 · 3 comments · Fixed by #54999
Assignees
Labels
>bug :Security/TLS SSL/TLS, Certificates

Comments

@sebgl
Copy link

sebgl commented Apr 7, 2020

Elasticsearch version: any (observed with 6.8.5 and 7.6.1)

I think we spotted a bug where TLS transport certificates are not picked up by Elasticsearch, if they are updated on disk around the same time Elasticsearch starts. See this issue in ECK repository.

It's not easy to reproduce, but we observed it happening several time across multiple rolling upgrade scenarios. The following seems to be happening:

  • ES is running, and configured to use a transport TLS certificate file on disk. Any modification on this file is correctly picked up.
  • We trigger a rolling upgrade on the cluster. When a node is restarted, it may restart with a different IP address (we're running in Kubernetes). As a result, ECK generates a new certificate to match the new IP address, and replaces the existing one on disk.
  • In case we're not lucky, this happens when Elasticsearch (re)starts:
    • It picks the existing certificates on disk: the old ones with the old IP address, that has not been replaced yet
    • At that point the certificate file is updated
    • Elasticsearch starts monitoring the filesystem for any change on certificates file. We missed the certificate update here, and still serve the old certificate that does not exist on the disk anymore. The new cert on disk does not get picked up until it gets modified again.

I can imagine this is unlikely to happen since most people won't manipulate TLS certificates while Elasticsearch is still starting. However this is what happens in ECK, since we cannot know the node IP address until the Pod is created, at which point Elasticsearch is starting already.

More generally, as a user, I would expect Elasticsearch to pick up my TLS certificate file, whether I create/update them before Elasticsearch starts, during the startup, or after Elasticsearch is started. Having a small time window where this is not the case feels like a bug?

A quick chat with @ywangd confirmed, from reading the code, that there's a time window between loading certificates and watching for changes where any update is not accounted for.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Network)

@ywangd
Copy link
Member

ywangd commented Apr 7, 2020

There are two reasons for it to happen, one major and one minor:

  • The high frequency watcher of ResourceWatcherService has an initial delay of 5 seconds before its 1st poll. If a cert file gets updated before the 1st poll, it won't be picked up. This could be the main contributor to the failure.

  • A minor possibility: Both SSLService and SSLConfigurationReloader are created in XPackPlugin as part of the bootstrap setup process. But ResourceWatcherService (used by SSLConfigurationReloader) does not start and queue its 1st poll until the start process is invoked. This is a much shorter gap. But may still worth considering.

@jaymode
Copy link
Member

jaymode commented Apr 7, 2020

I see this as more of a general issue since not only security could be affected:

  1. Starting of the resource watcher service. It should be one of the first items constructed and started within a node IMO.
  2. Ensuring that when an item is added to be watched we capture the relevant stats so we can detect a change between add and first poll

@jaymode jaymode self-assigned this Apr 8, 2020
sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Apr 8, 2020
A bug in Elasticsearch
(elastic/elasticsearch#54867) leads to TLS
certificates not being picked up during rolling upgrades. Impacted ES
nodes present a certificate with the wrong (old) Pod IP address.

Full TLS verification in those conditions lead to nodes not being able
to trust each other during (and after) rolling upgrades.

Let's switch TLS verification back to "certificate" (IP address is
ignored in the certificate but we still validate the cert signature),
until we figure out the proper way to reintroduce "full" verification.
sebgl added a commit to elastic/cloud-on-k8s that referenced this issue Apr 8, 2020
…2831)

A bug in Elasticsearch
(elastic/elasticsearch#54867) leads to TLS
certificates not being picked up during rolling upgrades. Impacted ES
nodes present a certificate with the wrong (old) Pod IP address.

Full TLS verification in those conditions lead to nodes not being able
to trust each other during (and after) rolling upgrades.

Let's switch TLS verification back to "certificate" (IP address is
ignored in the certificate but we still validate the cert signature),
until we figure out the proper way to reintroduce "full" verification.
sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Apr 8, 2020
…lastic#2831)

A bug in Elasticsearch
(elastic/elasticsearch#54867) leads to TLS
certificates not being picked up during rolling upgrades. Impacted ES
nodes present a certificate with the wrong (old) Pod IP address.

Full TLS verification in those conditions lead to nodes not being able
to trust each other during (and after) rolling upgrades.

Let's switch TLS verification back to "certificate" (IP address is
ignored in the certificate but we still validate the cert signature),
until we figure out the proper way to reintroduce "full" verification.
sebgl added a commit to elastic/cloud-on-k8s that referenced this issue Apr 8, 2020
…2831) (#2834)

A bug in Elasticsearch
(elastic/elasticsearch#54867) leads to TLS
certificates not being picked up during rolling upgrades. Impacted ES
nodes present a certificate with the wrong (old) Pod IP address.

Full TLS verification in those conditions lead to nodes not being able
to trust each other during (and after) rolling upgrades.

Let's switch TLS verification back to "certificate" (IP address is
ignored in the certificate but we still validate the cert signature),
until we figure out the proper way to reintroduce "full" verification.
jaymode added a commit to jaymode/elasticsearch that referenced this issue Apr 8, 2020
The ResourceWatcherService enables watching of files for modifications
and deletions. During startup various consumers register the files that
should be watched by this service. There is behavior that might be
unexpected in that the service may not start polling until later in the
startup process due to the use of lifecycle states to control when the
service actually starts the jobs to monitor resources. This change
removes this unexpected behavior so that upon construction the service
has already registered its tasks to poll resources for changes. In
making this modification, the service no longer extends
AbstractLifecycleComponent and instead implements the Closeable
interface so that the polling jobs can be terminated when the service
is no longer required.

Relates elastic#54867
jaymode added a commit to jaymode/elasticsearch that referenced this issue Apr 9, 2020
This change reworks the loading and monitoring of files that are used
for the construction of SSLContexts so that updates to these files are
not lost if the updates occur during startup. Previously, the
SSLService would parse the settings, build the SSLConfiguration
objects, and construct the SSLContexts prior to the
SSLConfigurationReloader starting to monitor these files for changes.
This allowed for a small window where updates to these files may never
be observed until the node restarted.

To remove the potential miss of a change to these files, the code now
parses the settings and builds SSLConfiguration instances prior to the
construction of the SSLService. The files back the SSLConfiguration
instances are then registered for monitoring and finally the SSLService
is constructed from the previously parse SSLConfiguration instances. As
the SSLService is not constructed when the code starts monitoring the
files for changes, a CompleteableFuture is used to obtain a reference
to the SSLService; this allows for construction of the SSLService to
complete and ensures that we do not miss any file updates during the
construction of the SSLService.

While working on this change, the SSLConfigurationReloader was also
refactored to reflect how it is currently used. When the
SSLConfigurationReloader was originally written the files that it
monitored could change during runtime. This is no longer the case as
we stopped the monitoring of files that back dynamic SSLContext
instances. In order to support the ability for items to change during
runtime, the class made use of concurrent data structures. The use of
these concurrent datastructures has been removed and the class is now
primarily a utility class that no longer needs to be instantiated.

Closes elastic#54867
jaymode added a commit that referenced this issue Apr 16, 2020
The ResourceWatcherService enables watching of files for modifications
and deletions. During startup various consumers register the files that
should be watched by this service. There is behavior that might be
unexpected in that the service may not start polling until later in the
startup process due to the use of lifecycle states to control when the
service actually starts the jobs to monitor resources. This change
removes this unexpected behavior so that upon construction the service
has already registered its tasks to poll resources for changes. In
making this modification, the service no longer extends
AbstractLifecycleComponent and instead implements the Closeable
interface so that the polling jobs can be terminated when the service
is no longer required.

Relates #54867
jaymode added a commit to jaymode/elasticsearch that referenced this issue Apr 16, 2020
The ResourceWatcherService enables watching of files for modifications
and deletions. During startup various consumers register the files that
should be watched by this service. There is behavior that might be
unexpected in that the service may not start polling until later in the
startup process due to the use of lifecycle states to control when the
service actually starts the jobs to monitor resources. This change
removes this unexpected behavior so that upon construction the service
has already registered its tasks to poll resources for changes. In
making this modification, the service no longer extends
AbstractLifecycleComponent and instead implements the Closeable
interface so that the polling jobs can be terminated when the service
is no longer required.

Relates elastic#54867
jaymode added a commit that referenced this issue Apr 16, 2020
The ResourceWatcherService enables watching of files for modifications
and deletions. During startup various consumers register the files that
should be watched by this service. There is behavior that might be
unexpected in that the service may not start polling until later in the
startup process due to the use of lifecycle states to control when the
service actually starts the jobs to monitor resources. This change
removes this unexpected behavior so that upon construction the service
has already registered its tasks to poll resources for changes. In
making this modification, the service no longer extends
AbstractLifecycleComponent and instead implements the Closeable
interface so that the polling jobs can be terminated when the service
is no longer required.

Relates #54867
Backport of #54993
jaymode added a commit that referenced this issue Apr 16, 2020
This change reworks the loading and monitoring of files that are used
for the construction of SSLContexts so that updates to these files are
not lost if the updates occur during startup. Previously, the
SSLService would parse the settings, build the SSLConfiguration
objects, and construct the SSLContexts prior to the
SSLConfigurationReloader starting to monitor these files for changes.
This allowed for a small window where updates to these files may never
be observed until the node restarted.

To remove the potential miss of a change to these files, the code now
parses the settings and builds SSLConfiguration instances prior to the
construction of the SSLService. The files back the SSLConfiguration
instances are then registered for monitoring and finally the SSLService
is constructed from the previously parse SSLConfiguration instances. As
the SSLService is not constructed when the code starts monitoring the
files for changes, a CompleteableFuture is used to obtain a reference
to the SSLService; this allows for construction of the SSLService to
complete and ensures that we do not miss any file updates during the
construction of the SSLService.

While working on this change, the SSLConfigurationReloader was also
refactored to reflect how it is currently used. When the
SSLConfigurationReloader was originally written the files that it
monitored could change during runtime. This is no longer the case as
we stopped the monitoring of files that back dynamic SSLContext
instances. In order to support the ability for items to change during
runtime, the class made use of concurrent data structures. The use of
these concurrent datastructures has been removed.

Closes #54867
jaymode added a commit to jaymode/elasticsearch that referenced this issue Apr 16, 2020
This change reworks the loading and monitoring of files that are used
for the construction of SSLContexts so that updates to these files are
not lost if the updates occur during startup. Previously, the
SSLService would parse the settings, build the SSLConfiguration
objects, and construct the SSLContexts prior to the
SSLConfigurationReloader starting to monitor these files for changes.
This allowed for a small window where updates to these files may never
be observed until the node restarted.

To remove the potential miss of a change to these files, the code now
parses the settings and builds SSLConfiguration instances prior to the
construction of the SSLService. The files back the SSLConfiguration
instances are then registered for monitoring and finally the SSLService
is constructed from the previously parse SSLConfiguration instances. As
the SSLService is not constructed when the code starts monitoring the
files for changes, a CompleteableFuture is used to obtain a reference
to the SSLService; this allows for construction of the SSLService to
complete and ensures that we do not miss any file updates during the
construction of the SSLService.

While working on this change, the SSLConfigurationReloader was also
refactored to reflect how it is currently used. When the
SSLConfigurationReloader was originally written the files that it
monitored could change during runtime. This is no longer the case as
we stopped the monitoring of files that back dynamic SSLContext
instances. In order to support the ability for items to change during
runtime, the class made use of concurrent data structures. The use of
these concurrent datastructures has been removed.

Closes elastic#54867
Backport of elastic#54999
jaymode added a commit to jaymode/elasticsearch that referenced this issue Apr 17, 2020
This change reworks the loading and monitoring of files that are used
for the construction of SSLContexts so that updates to these files are
not lost if the updates occur during startup. Previously, the
SSLService would parse the settings, build the SSLConfiguration
objects, and construct the SSLContexts prior to the
SSLConfigurationReloader starting to monitor these files for changes.
This allowed for a small window where updates to these files may never
be observed until the node restarted.

To remove the potential miss of a change to these files, the code now
parses the settings and builds SSLConfiguration instances prior to the
construction of the SSLService. The files back the SSLConfiguration
instances are then registered for monitoring and finally the SSLService
is constructed from the previously parse SSLConfiguration instances. As
the SSLService is not constructed when the code starts monitoring the
files for changes, a CompleteableFuture is used to obtain a reference
to the SSLService; this allows for construction of the SSLService to
complete and ensures that we do not miss any file updates during the
construction of the SSLService.

While working on this change, the SSLConfigurationReloader was also
refactored to reflect how it is currently used. When the
SSLConfigurationReloader was originally written the files that it
monitored could change during runtime. This is no longer the case as
we stopped the monitoring of files that back dynamic SSLContext
instances. In order to support the ability for items to change during
runtime, the class made use of concurrent data structures. The use of
these concurrent datastructures has been removed.

Closes elastic#54867
Backport of elastic#54999
jaymode added a commit that referenced this issue Apr 18, 2020
This change reworks the loading and monitoring of files that are used
for the construction of SSLContexts so that updates to these files are
not lost if the updates occur during startup. Previously, the
SSLService would parse the settings, build the SSLConfiguration
objects, and construct the SSLContexts prior to the
SSLConfigurationReloader starting to monitor these files for changes.
This allowed for a small window where updates to these files may never
be observed until the node restarted.

To remove the potential miss of a change to these files, the code now
parses the settings and builds SSLConfiguration instances prior to the
construction of the SSLService. The files back the SSLConfiguration
instances are then registered for monitoring and finally the SSLService
is constructed from the previously parse SSLConfiguration instances. As
the SSLService is not constructed when the code starts monitoring the
files for changes, a CompleteableFuture is used to obtain a reference
to the SSLService; this allows for construction of the SSLService to
complete and ensures that we do not miss any file updates during the
construction of the SSLService.

While working on this change, the SSLConfigurationReloader was also
refactored to reflect how it is currently used. When the
SSLConfigurationReloader was originally written the files that it
monitored could change during runtime. This is no longer the case as
we stopped the monitoring of files that back dynamic SSLContext
instances. In order to support the ability for items to change during
runtime, the class made use of concurrent data structures. The use of
these concurrent datastructures has been removed.

Closes #54867
Backport of #54999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Security/TLS SSL/TLS, Certificates
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants